Stories

D-Lib Magazine
October 1999

Volume 5 Number 10

ISSN 1082-9873

Multilingual Information Discovery and AccesS (MIDAS)

A Joint ACM DL'99 / ACM SIGIR'99 Workshop

blue line

Douglas Oard
University of Maryland
oard@glue.umd.edu
Carol Peters
Consiglio Nazionale delle Ricerche
carol@iei.pi.cnr.it
 
Miguel Ruiz
University of Iowa
mruiz@cs.uiowa.edu
Robert Frederking
Carnegie Mellon University
ref@cs.cmu.edu
 
Judith Klavans
Columbia University
klavans@cs.columbia.edu
Páraic Sheridan
TextWise LLC
paraic@textwise.com

Introduction

The development of technologies that enable access to information regardless of geographic or language barriers is a key factor for truly global sharing of knowledge. Users of internationally distributed information networks need tools that allow them to find, retrieve and understand relevant information, in whatever language and form it may have been stored. This pressure has created a convergence of interests from diverse research communities around a topic that we refer to as Multilingual Information Discovery and AccesS (MIDAS). This multidisciplinary enterprise has drawn the attention of researchers with backgrounds in fields such as information retrieval, natural language processing, machine translation, summarization, speech processing, document image understanding, and human-computer interaction. Sharing ideas, approaches and methodologies between those communities should accelerate research progress, but crafting effective cross-community venues is a challenging task. The conjunction of the ACM conferences on digital libraries and information retrieval provided a perfect opportunity to assemble a group with a broad range of perspectives. The aim of the MIDAS Workshop, held in Berkeley on Saturday August 14, 1999, was thus not only to provide a forum in which researchers could describe ongoing work and present new results, but also to see the issues from other perspectives and to discuss unresolved questions that might be best addressed with approaches that draw from more than one research tradition. The workshop was attended by about 50 participants from academic, government and industrial institutions in North America, Europe and Asia. It was co-chaired by Douglas Oard (College of Library and Information Services, University of Maryland, USA) and Carol Peters (Istituto di Elaborazione della Informazione, National Research Council, Italy).

This was not the first meeting to seek to bring together researchers from different communities around this topic. A first workshop on cross-language information retrieval -- at SIGIR ’96 -- brought together people with experience in information retrieval, machine translation, computational linguistics and digital libraries (Grefenstette, 1998). Later that year, the U.S. Defense Advanced Research Projects Agency organized a similar meeting between people with experience in machine translation and speech processing. Participants from each of those workshops came together at Stanford for an AAAI symposium in 1997 (Hull and Oard, 1997). Two working groups have been jointly funded by the U.S. National Science Foundation and the European Commission to examine multilingual information access and management from the perspectives of digital libraries and computational linguistics (Klavans and Schäuble, 1998; Hovy et al., 1999), and presentations on aspects of the problem now routinely appear at major conferences on digital libraries, information retrieval, machine translation and computational linguistics.

In opening the workshop, Carol Peters explained that the MIDAS workshop aimed to take the discussion a step further by:

Addressing such issues requires a mixture of theory, experience and insight, so the workshop schedule was planned to provide room for all three. Each of the first four sessions was built around one of the three questions identified above (the first point being the focus of two sessions). Each session included a keynote to help set the stage, and three sessions included a research presentation illustrating some aspect of the state of the art. The goal was to prime the discussion with one or two talks on each subject, but to preserve as much time as possible for an exchange of opinions between the panel and the audience. The final session consisted of a panel discussion in which representatives from governmental and nongovernmental organizations talked about the directions that the global research agenda should take.

In this report, we summarize the key issues raised in each session. Miguel Ruiz (University of Iowa, USA) and Anne Diekema (TextWise LLC, USA) served as rapporteurs. The workshop program, position papers, research papers, and slides from most of the presentations can be found at <http://www.clis.umd.edu/conferences/midas.html>.

Multilingual Access to Electronic Texts

The first session, led by session chair Judith Klavans (Center for Research on Information Access, Columbia University, USA), addressed topics ranging from historical perspectives on developments in cross-language information retrieval (CLIR) to user needs. The discussion was provocative, and included questions on the role of different tasks in CLIR, the range of user abilities, and the need for cross-fertilization between fields.

The keynote speaker in this session, Eduard Hovy (Information Sciences Institute, University of Southern California, USA), focused on the need for integration between the machine translation (MT) and information retrieval (IR) communities in order to tackle the multilingual retrieval problem. He identified three generations of multilingual systems. In generation 0, multilingual systems were built by performing MT and IR in a sort of pipelined process: CLIR = MT | IR. Generation 1, which includes systems that are being built today, interleaves the functionality of both communities more tightly to overcome their respective weaknesses, thus building more efficient and robust systems. The generation 2 systems that Hovy envisions will build new modules that mix MT and IR functions in such a way that the border between them cannot be easily recognized. Hovy identified four sources of inaccuracy in CLIR: query underspecification, query translation, unfocused full document retrieval, and full document translation. He outlined two potential solutions: (1) have the user help when translating the query and selecting relevant documents, and (2) have IR and MT compensate for each other’s weaknesses. In Hovy’s opinion, IR is unfocused because it seeks to provide documents rather than answers. In this respect, Natural Language Processing (NLP) techniques can help by analyzing sentence structure and context. Hovy dedicated the major part of his talk to discussing some of the important steps in CLIR and showing how component technologies could be mixed and matched.

The keynote was followed by a research paper presented by William Ogden (Computing Research Lab, New Mexico State University, USA). Ogden described experiences at NMSU with the development and evaluation of systems for cross-language retrieval, summarization and translation. He posed two fundamental questions concerning user/system interaction: (1) who are the users? and (2) what are their goals? Ogden suggested two illustrative scenarios to provide a framework for considering these questions: (1) an advanced foreign language reader, who may simply need help with cross-language query formulation, and (2) a monolingual user who wishes to pose a query in a single language and then understand the answers contained in any retrieved document regardless of language. He used this discussion of user needs in terms of the tasks involved to highlight the need for task-based evaluation. Ogden then described results from several experiments that evaluated the usefulness of different kinds of visual presentations for two tasks: (1) helping users judge the relevance of documents they cannot read, and (2) helping users reformulate a query. He showed a document thumbnail view of retrieval results in which a user can see the spatial distribution of key terms in each document at a glance. The goal of new presentation technologies of this type is to reduce the time spent in repeating and refining the viewing of results. Results of evaluation showed that users were able to recognize relevant documents using thumbnails alone, rather than titles. They could retrieve useful information just by knowing where the key terms in a document occurred. Ogden concluded with a demonstration of a summarization system in which named entities are used to help select the best sentences.

Mun-Kew Leong (Kent Ridge Digital Labs, Singapore) made the first of the two short panel presentations in this session, drawing an analogy between multilingual information and database normalization. He suggested that several of the steps involved in CLIR could be seen as normalization, for example: (1) normalize the queries -- by translating the keywords (2) normalize the data -- by translating the collection, or (3) normalize the index space -- through techniques such as Cross-Language Latent Semantic Indexing. The overarching message was that the IR community might learn from the database community, in which the normalization process is a key to less ambiguous semantics.

W. Bruce Croft (Center for Intelligent Information Retrieval, University of Massachusetts, USA), summarized some research results from his group. He reported that dictionary-based query translation, combined with some low level but robust disambiguation, can be a successful CLIR approach, achieving results close to monolingual IR (90% of monolingual retrieval effectiveness by commonly used measures). A comment was made regarding the difficulty of obtaining useful machine-readable dictionaries. Croft observed that rapid incorporation of new languages for which limited language resources exist will be a key issue in this regard. He suggested other priorities for CLIR research as well, including summarization and visualization of results, claiming that such topics are even more important for CLIR than in monolingual IR since there can be wide variation in user abilities and user needs.

Much of the discussion in the first session focused on the differing perspectives of the IR community, which has found statistical methods to be particularly useful, and the NLP community, which has embraced a mixture of both statistical and rule-based symbolic techniques. It was claimed that the integration of NLP and IR does not merely add complexity, but rather multiplies it because present techniques in each field are quite "fragile," requiring extensive tuning to perform well in specific applications. It was thus generally agreed that continued communication between the two communities will be needed if we are to make progress in multilingual information discovery and access. There was, however, less agreement regarding Hovy’s claim that substantial unrealized potential exists for exploiting NLP in IR systems -- the crucial question that was posed is "where is the evidence that NLP techniques beyond those presently used (e.g., stemming) would actually improve retrieval?" The tenability of manual query refinement was also challenged by an observation that systems that rely on such approaches have not always been well received in monolingual applications.

Handling Multilingual Speech, Document Images, and Video OCR

The second session focused on access to multilingual information in formats other than character-coded text. An underlying theme, posed by session chair Robert Frederking (Language Technologies Institute, Carnegie Mellon University, USA), was "does information access using modalities other than electronic text involve anything beyond retrieval from degraded text?" This led to a discussion of issues including the cost of adding new languages to a multilingual system, automatic topic labeling for foreign language video, the state of the art in document image processing technology, and CLIR from recorded speech.

The keynote speaker was Peter Schäuble (Eurospider Information Technology, Switzerland). Schäuble presented an overview of the processing stages and resources that are needed to produce a state-of-the-art multilingual information retrieval system. His conclusion was that system development using the present methodology would require about ten person-years of effort for each additional language, allocated as follows: 64 person-months for resource acquisition and processing, 32 person-months for software development, and 18 person-months for system tuning. Covering the thirty languages proposed for the U.S. Translingual Information Detection, Extraction and Summarization (TIDES) program would thus require a total of 300 person/years if we were to keep working in the same way that we are now. Schäuble’s presentation stimulated considerable discussion, both about cost factors that might grow less rapidly as new languages are added (evaluation, for example) and about which components would be most amenable to lower-cost techniques.

The research paper by Alexander Hauptmann (School of Computer Science, Carnegie Mellon University, USA) explained how automatic topic labeling for multilingual broadcast news has been implemented in the Informedia project. Informedia has long included topic-labeling for English video, and Hauptmann explained how this capability has been extended to foreign-language stories. English topic labels were thought to be more useful than title translation for identifying relevant stories in a foreign language because title translations are error-prone due to their telegraphic style and the resulting lack of adequate context to constrain word choice. Hauptmann explained that the probabilities of successful speech recognition, machine translation and information retrieval combine in a multiplicative way, but that exploitation of redundant information in different modalities (including metadata) can help to overcome this limitation. He gave a brief demonstration of the system, showing results obtained using Croatian television news stories. As a final note, Hauptmann observed that the k-nearest-neighbors approach that is presently employed may not be well suited for use with topics that change significantly over time because the training data would need to be updated periodically.

The first panelist to speak was Henry Baird (Information Sciences and Technologies Lab, Xerox Palo Alto Research Center, USA). Baird described the state-of-the-art in document image processing technology, including language recognition, two-dimensional layout analysis, recognition of document structure, and Optical Character Recognition (OCR). He pointed out that while OCR can perform well, it is subject to unpredictable, catastrophic failure, and illustrated the point by describing the disappointing results of early experience with the integration of off-the-shelf OCR and machine translation systems in the US Army Research Laboratory’s FALCON system. He observed that combining OCR with information retrieval seems to work better because the "bag of words" approach used in IR is quite robust. He reported that monolingual OCR systems are presently available for a relatively small number of languages and that, with the exception of a system for English and Japanese, there is almost no ability to handle mixed-language documents. Rapidly retargeting an OCR system to a new language is presently quite challenging, largely due to the monolithic structure of current OCR technology in which language-specific constraints are deeply enmeshed with other code. On the brighter side, language identification in document images turns out to be a quite tractable problem, much easier and more effective than might initially have been expected.

George Doddington (National Institute for Standards and Technology, USA) described the U.S. Topic Detection and Tracking (TDT) initiative. TDT has adopted an evaluation-driven research paradigm in which participants work with broadcast news (newswire stories and automatic transcriptions of radio and television broadcasts), attempting to answer the question "What’s New?" (detection) and to track topics over time. This year, TDT-3 is using English and Mandarin Chinese data for five tasks:

A recurring observation throughout the session was that there was a pressing need for the integration of technologies in order to achieve progress in multilingual multimodal system development.

Multilingual Metadata

The aim of the third session, chaired by Shigeo Sugimoto (University of Library and Information Science, Japan), was to explore issues regarding multilingual metadata in order to understand their relevance to multilingual information discovery and access. The Dublin Core effort (OCLC, 1999) has spawned a community interested in metadata standards in a multilingual environment, and that community has in turn built a working relationship with people interested in multilingual metadata content. This area is still very much in an initial exploratory stage, and thus no research paper was presented.

The keynote speaker was John Kunze (National Library of Medicine, USA), who focused on metadata standards. Kunze described the current state of the Dublin Core standard, reporting that it is presently described in sixteen languages, with about twelve more on the way (Baker, 1998). He observed that the fifteen core elements of the "standard" have been adapted and tailored by each organization that has adopted it, and offered a worse-is-better hypothesis in which implementation simplicity takes priority over completeness and consistency as a way of explaining this. Kunze stressed the importance of document surrogates (such as those provided by metadata) as a basis for information discovery, but noted that whether the Dublin Core in particular is useful for this purpose still remains to be established.

The panel session was opened by José Luis Borbinha (Biblioteca Nacional, Portugal), who offered a perspective on metadata as a device for mediating between an information provider and an information seeker. This led him to focus on metadata content, in contrast to the standards issues addressed by Kunze. Borbinha believes that there should be movement towards normalization of metadata content, with a more consistent use of controlled vocabularies for names, institutions, subject fields, etc. He also introduced the question of how we can evaluate the utility of metadata, a topic that received some attention in the next session.

Clifford Lynch (Coalition for Networked Information, USA) observed that with multilingual metadata we are in essence talking about cross-cultural retrieval. This is far more than mere cross-language searching, he claimed, and he offered as an example the fact that places and institutions in another country might be organized in ways that are not amenable to simple translation. Lynch views metadata as an encoding system that uses a quasi-natural language structure to create document surrogates that are critical for resource discovery. One particular problem noted by Lynch is that, in some cases, validity checks are needed to preclude abuse when the metadata is provided by the same source as the information it describes.

Evaluation of Multilingual Systems

Information retrieval and machine translation have developed quite different evaluation methodologies, and multilingual information discovery and access lies squarely at the intersection of the two. The goals for this session, chaired by Páraic Sheridan (TextWise LLC, USA), were to identify important issues for the evaluation of multilingual systems by considering the way in which IR systems are presently evaluated and then exploring the extent to which evaluation methodologies from related fields can be usefully incorporated.

Donna Harman (National Institute of Standards and Technology (NIST), USA) presented the keynote talk in which she described the evaluation of multilingual systems within the context of the Text Retrieval Conferences (TREC) (NIST, 1999). Spanish documents and queries were introduced in TREC-3, followed by Chinese in TREC-4. Retrieval systems that had been designed for English transferred remarkably well to monolingual applications in these languages, requiring only relatively minor modifications such as the replacement of English word stems with Chinese character bigrams. The Chinese evaluation task turned out to be "too easy," however, with most systems performing so well that it was almost impossible to distinguish between the results of systems that used widely different techniques. That test collection is therefore of relatively little value as a resource for further experimentation. In TREC-6 a Cross-Language Retrieval task was introduced with documents and queries in English, French, and German, and Italian was added in TREC-7. Through this experience, NIST has learned valuable lessons about designing an evaluation that includes multiple languages. Query formulation and relevance judgments are now performed by groups in Europe in order to involve native speakers. And as issues related to the cultural aspects of cross-language topics have arisen, it has become clear that topic translation must be more conceptual than literal. This approach has implications for the comparative evaluation of systems, but it ensures that the topic statements are more natural in each of the languages.

A research paper was presented by Aitao Chen (School of Information Management and Systems, University of California Berkeley, USA). In his talk, he described the automatic construction of a bilingual lexicon using a new Japanese/English CLIR test collection (NACSIS, 1999). Chen’s approach involved using aligned sentences from parallel texts of scientific abstracts and automatically creating a Japanese-English lexicon through statistical identification of word-word translation pairs in the aligned sentences. The lexicon was then used to translate queries from Japanese into English, and the queries were used to retrieve abstracts from an English portion of the test collection. Chen explored both the performance of individual system components and their contribution to overall task performance.

The first contribution from the panel was by John White (Litton PRC, USA). He observed that translation engines are now merely one component of complex multilingual systems, so machine translation evaluation should be viewed from the perspective of the contribution of translation to "downstream" applications. White stated that evaluation of multilingual systems is a difficult task because the evaluation of translation results is necessarily subjective, observing that the evaluation of translation and summarization are similar in this regard because there is no one "right" answer in either case.

Noriko Kando (National Center for Science Information Systems (NACSIS), Japan) briefly described her experiences as the organizer of the first NACSIS Test Collection Information Retrieval (NTCIR) evaluation (NACSIS, 1999). NTCIR was designed to evaluate retrieval in a Japanese environment, with a focus on abstracts of scientific papers. The evaluation included three tasks: ad hoc retrieval of mixed language (Japanese and English) documents using Japanese queries, cross-language retrieval of English documents using Japanese queries, and an information extraction task known as term recognition and role analysis. This is the first year for the NTCIR evaluation, and 23 groups from four countries participated.

The discussion centered on the costs of evaluation, the need to establish satisfactory criteria that depend on the component to be assessed, and the fact that some tasks cannot be evaluated on a "right or wrong" basis. Where possible, the creation of "ground truth" data such as TREC relevance assessments can be an important factor in keeping costs down. By contrast, it was claimed that the subjective nature of machine translation evaluation makes it impossible to create useful ground truth for translations that are meant to be read by people: factors of style, quality, utility, and target audience are always involved. Metadata evaluation is an issue about which not much is presently known, at least by that name, but it was observed that the question is reminiscent of the early debates between advocates of controlled vocabularies and free text searching. Early studies in the IR literature showed that retrieval using controlled vocabularies was highly variable, often either a great success or a total failure. Baird reported that monolingual OCR evaluation resources are well established for some languages and that clearly defined metrics exist, but that little evaluation has been done in the context of multilingual document image processing applications. There was general agreement with the observation that evaluation -- whatever the component type being evaluated -- should be primarily task-driven.

The Way Ahead

The final session, chaired by Douglas Oard, bought together a panel with a variety of perspectives on the management of research. The first speaker was Noriko Kando, who identified five layers of technology that should be considered:

Each layer raises specific problems and must be addressed independently before integration is possible.

Hans-Georg Stork (DG XIII/E5, European Commission, Belgium) outlined the funding policies of the European Commission (EC) in the digital library and multilingual information access areas. He described the shift in focus from the Fourth Framework to the Fifth Framework, and the key lines of action within the Fifth Framework. Multilinguality is clearly a priority area for European research -- the Commission gives equal status to each official language and devotes resources to European minority languages as well. In the Fourth Framework, considerable funding was allocated to the development of linguistic resources under the Language Engineering programme. In the Fifth Framework, language resources are expected to be integrated with language technologies in the Information Societies Technology programme, with the greatest concentration of work likely being in two "action lines:" Information Access, Filtering, Analysis and Handling, and Human Language Technologies (EC, 1999). The EC and the US National Science Foundation recently released a joint call for proposals to support international collaboration on Multilingual Information Access and Management that generated considerable interest in the research community (NSF, 1999). At the time of the workshop, the proposals received for that call were under review, and a similar joint EC/NSF program for international digital library research is now being considered.

Ronald Larsen (Maryland Applied Information Technology Initiative, until recently with the Defense Advanced Research Projects Agency (DARPA), USA) described the new DARPA program for Translingual Information Detection, Extraction and Summarization (TIDES) (DARPA, 1999). The main goal of TIDES is to dramatically reduce the time it takes to develop integrated systems that apply these technologies to languages for which the required resources do not presently exist. Larsen noted that multilingual information discovery and access will likely require fusion of technologies that have not previously been used together. He also suggested some key points that are considered by funding agencies when evaluating proposals:

Clifford Lynch was the final speaker in the session. He observed that much of our discussion had focused on technologies, and that more discussion about applications -- particularly digital library applications -- would be useful. Lynch also raised issues of modularity, interoperability, and scalability that must be addressed if system performance is to be guaranteed.

The discussion following the presentations focused on collaborative activities, both internationally and across present disciplinary boundaries. The idea of creating some sort of registry for multilingual resources as a way of accelerating research in this field and avoiding duplication of effort was broadly supported. Stork suggested that one or more of the funding agencies would find a proposal to create such a registry attractive if it went significantly beyond what is presently available.

At the end of the session, Oard asked the participants in the workshop to identify problems that they would advise a doctoral student just starting to work in this area to consider. The range of answers revealed the diverse perspectives brought by the participants, and included:

Conclusion

Multilingual Information Discovery and Access is a broad topic, and it is certainly not possible to treat every aspect of it with equal depth in a single day. At the outset we had sought to explore three key issues, and we made some progress with each:

The workshop attracted an extremely broad cross-section of researchers, and important contributions were made both by the speakers and by members of the audience on a number of other points. Among these were:

This MIDAS workshop has been one link in a chain of events that have brought together researchers from around the world and across a broad range of disciplines to consider different aspects of multilingual information discovery and access. The strength of this approach is the ability to include new communities while achieving a sufficient degree of focus to make progress. Living as it does at the intersection of several disciplines, it seems that a sequence of what are essentially "one of a kind" meetings serves the needs of researchers in this area well. Leadership in this field has come from a wide range of sources, and we are eager to see that trend continue. Accordingly, we are interested in discussing the potential for future MIDAS workshops that address some of the opportunities identified above, or perhaps issues that we have not yet foreseen, with people who would be interested in building on our work. We profited from a great deal of valuable advice from those who have gone before, and will gladly share our perspective on what works with the next organizers.

References:

Baker, T., Languages for Dublin Core, D-Lib, December 1998 (available at http://www.dlib.org/dlib/december98/12baker.html).

DARPA Translingual Information Detection, Extraction and Summarization (TIDES) program, 1999 (available at http://www.darpa.mil/ito/ResearchAreas.html).

EC Fifth Framework Program, 1999 (available at http://europa.eu.int/comm/dg13/fifth-framework-programme.htm.

Grefenstette, G. (ed.), Cross-Language Information Retrieval, Kluwer Academic Publishers, Boston, 1998 (some papers available at http://www.rxrc.xerox.com/research/mltt/DMHead/CLIR/SIGIR96CLIR.html).

Hovy, E., Ide, N., Frederking, R., Mariani, J., Zampolli, A. (eds.), Multilingual Information Management: Current Levels and Future Abilities, 1999 (available at http://www.cs.cmu.edu/People/ref/mlim/).

Hull, D. and Oard, D. (eds.), "Cross-Language Text and Speech Retrieval" Papers from the 1997 AAAI Spring Symposium, Technical Report SS-97-05, AAAI Press (some papers available at http://www.clis.umd.edu/dlrg/filter/sss/papers/).

Klavans, J. and Schäuble, P., Summary Review of the Working Group on Multilingual Information Access, in Report of the Joint US National Science Foundation-European Union Working Groups on Future Developments for Digital Library Research, ERCIM Technical Report, No.98/W004, 1998 (available at http://www.iei.pi.cnr.it/DELOS/NSF/Brussrep.htm).

NACSIS Test Collection Information Retrieval Evaluation, 1999 (available at http://www.rd.nacsis.ac.jp/~ntcadm/index-en.html).

NIST Text Retrieval Conferences, 1999 (available at http://trec.nist.gov).

NSF Multilingual Information Access and Management: Call for International Research Co-operation, 1999 (available at http://www.interact.nsf.gov/cise/html.nsf/html/jointannounce?OpenDocument).

OCLC Dublin Core Metadata Initiative, 1999 (available at http://purl.org/metadata/dublin_core).

Copyright © 1999 Douglas Oard, Carol Peters, Miguel Ruiz, Robert Frederking, Judith Klavans, and Páraic Sheridan

Top | Contents
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next story
Home | E-mail the Editor

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/october99-oard