Multi-Media, Multi-Cultural, and Multi-Lingual Digital Libraries

Or How Do We Exchange Data In 400 Languages?

Christine L. Borgman
Professor and Chair
Department of Library and Information Science
Graduate School of Education & Information Studies
University of California, Los Angeles
Los Angeles, California
cborgman@ucla.edu

D-Lib Magazine, June 1997

ISSN 1082-9873

Introduction
Medium, Culture, and Language
From Local Systems to Global Systems

Design Tradeoffs
Representation in Digital Form

Language and Character Sets

Transliteration and Other Forms of Data Loss
Character Encoding
Mono-lingual, Multi-lingual, and Universal Character Sets

Library Community Approaches
Summary and Conclusions
References

Introduction

The Internet would not be very useful if communication were limited to textual exchanges between speakers of English located in the United States. Rather, its value lies in its ability to enable people from multiple nations, speaking multiple languages, to employ multiple media in interacting with each other. While computer networks broke through national boundaries long ago, they remain much more effective for textual communication than for exchanges of sound, images, or mixed media -- and more effective for communication in English than for exchanges in most other languages, much less interactions involving multiple languages.

Supporting searching and display in multiple languages is an increasingly important issue for all digital libraries accessible on the Internet. Even if a digital library contains materials in only one language, the content needs to be searchable and displayable on computers in countries speaking other languages. We need to exchange data between digital libraries, whether in a single language or in multiple languages. Data exchanges may be large batch updates or interactive hyperlinks. In any of these cases, character sets must be represented in a consistent manner if exchanges are to succeed. Issues of interoperability, portability, and data exchange (Libicki, 1995) related to multi-lingual character sets have received surprisingly little attention in the digital library community or in discussions of standards for information infrastructure, except in Europe. The landmark collection of papers on Standards Policy for Information Infrastructure (Kahin & Abbate, 1995), for example, contains no discussion of multi-lingual issues except for a passing reference to the Unicode standard (Libicki, 1995, p. 63).

The goal of this short essay is to draw attention to the multi-lingual issues involved in designing digital libraries accessible on the Internet. Many of the multi-lingual design issues parallel those of multi-media digital libraries, a topic more familiar to most readers of D-Lib Magazine. This essay draws examples from multi-media DLs to illustrate some of the urgent design challenges in creating a globally distributed network serving people who speak many languages other than English.

First we introduce some general issues of medium, culture, and language, then discuss the design challenges in the transition from local to global systems, lastly addressing technical matters. The technical issues involve the choice of character sets to represent languages, similar to the choices made in representing images or sound. However, the scale of the language problem is far greater. Standards for multi-media representation are being adopted fairly rapidly, in parallel with the availability of multi-media content in electronic form. By contrast, we have hundreds (and sometimes thousands) of years worth of textual materials in hundreds of languages, created long before data encoding standards existed. Textual content from past and present is being encoded in language and application-specific representations that are difficult to exchange without losing data -- if they exchange at all. We illustrate the multi-language DL challenge with examples drawn from the research library community, which typically handles collections of materials in 400 or so languages. These are problems faced not only by developers of digital libraries, but by those who develop and manage any communication technology that crosses national or linguistic boundaries.

Medium, Culture, and Language

Speaking is different from writing, and still images are different from moving images; verbal and graphical communication are yet more different from each other. Speaking in one's native language to people who understand that language is different than speaking through a translator. Language translations, whether oral or written, manual or automatic, cannot be true equivalents due to subtle differences between languages and the cultures in which they originate. Thus the content and effect of messages are inseparable from the form of communication and the language in which communicated.

For all of these reasons, we wish to capture DL content in the richest forms possible to assure the maximum potential for communication. We want accurate representations of the original form and minimal distortion of the creators' (author, artist, film maker, engineer, etc.) intentions. At the same time, we want to provide the widest array of searching, manipulation, display, and capture capabilities to those who seek the content, for the searchers or users of these digital libraries may come from different cultures and speak different languages than those of the creators.

Herein lies the paradox of information retrieval: the need to describe the information that one does not have. We have spent decades designing mechanisms to match up the expressions of searchers with those of the creators of textual documents (centuries, if manual retrieval systems are considered). This is an inherently insolvable problem due to the richness of human communication. People express themselves in distinctive ways, and their terms often do not match those of the creators and indexers of the information sought, whether human or machine. Conversely, the same terms may have multiple meanings in multiple contexts. In addition, the same text string may retrieve words in multiple languages, adding yet more variance to the results. Better retrieval techniques will narrow the gap between searchers and creators of content, but will never close that gap completely.

Searching for information in multi-media digital libraries is more complex than text-only searching. Consider the many options for describing sounds, images, numeric data sets, and mixed-media objects. We might describe sounds with words, or with other sounds (e.g., playing a tune and finding one like it); we might describe an image with words, by drawing a similar object, or by providing or selecting an exemplar. As Croft (1995) notes in an earlier D-Lib issue, general solutions to multi-media indexing are very difficult, and those that do exist tend to be of limited utility. The most progress is being made in well-defined applications in a single medium, such as searching for music or for photographs of faces.

Cultural issues pervade digital library applications, whether viewing culture at the application level, such as variations in approaches to image retrieval by the art, museum, library, scientific, and public school communities, or on a multi-national scale, such as the differing policies on information access between the United States and Central and Eastern Europe. Designing digital libraries for distributed environments involves complex tradeoffs between tailoring to local cultures and meeting the standards and practices necessary for interoperability with other systems and services (Borgman, et al., 1996).

From Local Systems to Global Systems

The easiest systems to design are those for well-defined applications and well-defined user populations. Under these conditions, designers can build closed systems tailored to a community of users, iteratively testing and refining capabilities. These are rare conditions today, however. More often, we are designing open systems that serve not only a local population, but also remote and perhaps unknown populations. Examples include digital libraries of scholarly materials built by and for one university, then later made openly available on the Internet; business assets databases developed and tested at a local site and then provided to corporate sites around the world; scientific databases designed for research applications, later made available for educational purposes; and library catalogs designed for a local university, later incorporated into national and international databases for resource sharing. Any of these applications could involve content in multiple media and multiple languages.

Design Tradeoffs

Consider how the design issues change from local to distributed systems. In local systems, designers can tailor user interfaces, representation of content, and functional capabilities to the local culture and to the available hardware and software. Input and output parameters are easily specified. If users need to create sounds or to draw, these capabilities can be provided, along with display, capture, and printing abilities in the matching standards. Keyboards can be set to support the local language(s) of input; screens and printers can be set to support the proper display of the local languages as well.

Designers have far less control over digital libraries destined for use in globally distributed environments. Users' hardware and software platforms are typically diverse and rapidly changing. Designers often must specify a minimum configuration or require a minimum version of client software, making tradeoffs between lowering the requirements to reach a larger population or raising requirements to provide more sophisticated capabilities. The more sophisticated the multi-media or multi-lingual searching capabilities, the higher the requirements are likely to be, and the fewer people that are likely to be served.

While good design includes employing applicable standards, determining which standards are appropriate in the rapidly evolving global information infrastructure involves tradeoffs as well. The use of some standards may be legislated by the parent organization or funding agency, and the use of other standards may be a matter of judging which are most stable and which are most likely to be employed in other applications with which the current system needs to exchange data. In the case of character sets for representing text in digital libraries, designers sometimes face a choice between a standard employed within their country to represent their national language and a universal character set in which their national language is more commonly represented in other countries. At present, massive amounts of textual data are being generated in digital form, and represented in formats specific to applications, language, and countries. The sooner the digital library community confronts this tidal wave of "legacy data" in incompatible representations, the more easily this interoperability problem may be solved.

Representation in Digital Form

Although we have been capturing text, images, and sounds in machine-readable forms for several decades, issues of representation became urgent only when we began to access, maintain, exchange, and preserve data in digital form. In information technologies such as film, phonograph, CD-ROM, and printing, electronic data often is an intermediate format. When the final product was published or produced, the electronic data often were destroyed, and the medium (disks, tapes, etc.) reused.

In digital libraries, the perspective changes in two important ways: (1) from static output to dynamic data exchange; and (2) from a transfer mechanism to a permanent archival form. In sound or print recordings, for example, once the record is issued or the book printed, it no longer matters how the content was represented in machine-readable form. In a digital library, the representation matters because the content must be continuously searched, processed, and displayed, and often must be exchanged with other applications on the same and other computers.

When electronic media were viewed only as transfer mechanisms, we made little attempt to preserve the content. Many print publications exist only in paper form, the typesetting tapes used to generate them long since overwritten. Much of the early years of television broadcasts were lost, as the recording media were reused or allowed to decay. Now we recognize that digital data must be viewed as a permanent form of representation, requiring means to store content in complete and authoritative forms, and to migrate content to new technologies as they appear.

Language and Character Sets

Character set representation is a problem similar to that of representing multi-media objects in digital libraries, yet is more significant due to the massive volume of textual communication and data exchange that takes place on computer networks. Culture plays a role here as well: speakers of all languages wish to preserve their language in its complete and authoritative form. Incomplete or incorrect data exchange results in failures to find information, in failures to authenticate identities or content, and in the permanent loss of information. Handling character sets for multiple languages is a pervasive problem in automation, and one of great concern to libraries, network developers, government agencies, banks, multi-national companies, and others exchanging information over computer networks.

Much to the dismay of the rest of the world, computer keyboards were initially designed for the character set of the English language, containing only 26 letters, 10 numbers, and a few special symbols. While variations on the typical English-language keyboard are used to create words in most other languages, doing so often results in either (1) a loss of data, or (2) encoding characters in a language-specific or application-specific format that is not readily transferable to other systems. We briefly discuss the problems involved in data loss and character encoding, then discuss some potential solutions.

Transliteration and Other Forms of Data Loss

Languages written in non-Roman scripts, such as Japanese, Arabic, Chinese, Korean, Persian (Farsi), Hebrew, and Yiddish (the "JACKPHY" languages), and Russian, are transliterated into Roman characters in many applications. Transliteration matches characters or sounds from one language into another; it does not translate meaning. Considerable data loss occurs in transliteration. The process may be irreversible, as variations occur due to multiple transliteration systems for a given language (e.g., Peking vs. Beijing, Mao Tse-tung vs. Mao Zedong (Chinese), Tchaikovsky vs. Chaikovskii (Russian)), and the transliterated forms may be unfamiliar to speakers of that language. Languages written in extensions of the Roman character set, such as French, Spanish, German, Hungarian, Czech, and Polish, are maintained in incomplete form in some applications by omitting diacritics (accents, umlauts, and other language-specific marks) that distinguish their additional characters.

These forms of data loss are similar to those of "lossy" compression of images, in which data are discarded to save storage costs and transmission time while maintaining an acceptable reproduction of the image. Any kind of data loss creates problems in digital libraries. Variant forms of words will not match and sort properly, incomplete words will not exchange properly with digital libraries using complete forms, and incomplete forms may not be adequate for authoritative or archival purposes. The amount of acceptable loss varies by application. Far more data loss is acceptable in applications such as email, where rapid communication is valued over authoritative form, than in financial or legal records, where authentication is essential, for example.

Character Encoding

The creation of characters in electronic form involves hardware and software to support input, storing, processing, sorting, displaying, and printing. The internal representation of each character determines how it is treated by the hardware (keyboard, printer, VDT, etc.) and the application software. Two characters may appear the same on a screen but be represented differently due to their different sorting positions in multiple languages, for example. Conversely, the same key sequence on two different keyboards may produce two different characters, depending upon the internal representation that is generated. Character encoding for digital libraries includes all of these aspects:

The keyboard commands used to generate characters, especially characters with diacritics, for building the digital library content;
The keyboard commands used to generate characters to search the digital library;
Rules for sorting characters in correct alphabetic sequence, which are dependent on the internal representation of the character; the correct sequence varies by language;
Display of characters on computer screens; and
Output of characters on printers and other devices.

Numerous possibilities exist for mismatches and errors in access to digital libraries in distributed environments, considering the vast array of hardware and software employed by DLs and their users and the variety of languages and character encoding systems that may be involved.

Mono-lingual, Multi-lingual, and Universal Character Sets

Many standards and practices exist for encoding characters. Some are language-specific, others are script-specific (e.g., Latin or Roman, Arabic, Cyrillic), and "universal" standards that support most of the world's written languages are now available. Exchanging data among digital libraries that employ different character encoding formats is the crux of the problem.

If mono-lingual DLs all use the same encoding format, such as ASCII for English, data exchange should be straightforward. If mono-lingual DLs use different formats, such as the three encoding formats approved for the Hungarian language by the Hungarian standards office (Számítástechnikai karakterkódok. A grafikus karakter magyar referenciakészlete, 1992), then data exchange encounters problems. Characters generated by a keyboard that is set for one encoding system may not match characters stored under another encoding system; characters with diacritics may display or print incorrectly or not display at all.

DLs using the same script-specific formats, such as Latin-2 extended ASCII that encompasses the major European languages, should be able to exchange data with each other. When DLs using Latin-2 attempt to exchange data in those same languages with DLs using language-specific formats, mismatches may occur. Similarly, mismatches may occur when DLs that employ Latin-2 for European languages exchange data with DLs that employ a different multi-lingual set such as the American Library Association character set (The American Library Association Character Set, 1989) commonly used in the United States.

After many years of international discussion on the topic, Unicode appears to be emerging as the preferred standard to support most of the world's written languages. A universal character set offers great promise for solving the data exchange problem. If data in all written languages are encoded in the same format, then data can be exchanged between mono-lingual and multi-lingual digital libraries. Just as the networked world is moving toward hardware platform-independent solutions, adopting Unicode widely would move us toward language-independent solutions to distributed digital libraries and to universal data exchange. Techniques for automatic language translation would be assisted by a common character set standard as well.

Any solution that appears too simple probably is. Major hardware and software vendors are beginning to support Unicode, but it is not embedded in much application software yet. Unicode requires 16 bits to store each character -- twice as much as ASCII, at 8 bits. However, Unicode requires only half as much space as the earlier version of ISO 10646 (32 bits), the competing and more comprehensive universal character set. Unicode emerged as the winner in a long standards battle, eventually merging with ISO 10646, because it was seen as easier to implement and thus more likely to be adopted widely. As storage costs continue to decline, the storage requirements of Unicode will be less of an issue for new applications. Massive amounts of text continue to be generated not only in language-specific and script-specific encoding standards, but in local and proprietary formats. Any of this text maintained in digital libraries may become "legacy data" that has to be mapped to Unicode or some other standard in the future. At present, digital library designers face difficult tradeoffs between the character set standards in use by current exchange partners, and the standard likely to be in international use in the future for a broader variety of applications.

Library Community Approaches

The international library community began developing large, multi-language digital libraries in the 1960s. Standards for record structure and character sets were established long before the Internet was created, much less Unicode. Hundreds of millions of bibliographic records exist around the world in variations of the MARC (MAchine Readable Cataloging) standard, although in multiple character set encoding formats. OCLC Online Computer Library Center, the world's largest cataloging cooperative, serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records of ownership attached in more than 370 languages (Mitchell, 1994; OCLC Annual Report, 1993; Smith, 1994). OCLC uses the American Library Association (ALA) character set standard, which extends the English-language keyboard to include diacritics from major languages (Agenbroad, 1992; The ALA Character Set, 1989). Text in most other languages is maintained in transliterated form.

The Library of Congress, which contributes its records in digital form to OCLC, RLIN (Research Libraries Information Network, the other major U.S.-based bibliographic utility), and other cooperatives, also do original script cataloging for the JACKPHY languages mentioned earlier. RLIN pioneered the ability to encode the JACKPHY languages in their original script form for bibliographic records, using available script-specific standards (Aliprand, 1992). Records encoded in full script form are exchanged between the Library of Congress, RLIN, OCLC, other bibliographic utilities in the U.S. and elsewhere, and many digital libraries maintained by research libraries. Catalog cards are printed in script and Romanized forms from these databases, but direct use of the records in script form requires special equipment to create and display characters properly. Records from OCLC, RLIN, and other sources are loaded into the online catalogs of individual libraries, where they usually are searchable only in transliterated forms. Some online catalogs support searching with diacritics, while others support only the ASCII characters. Regardless of the local input and output capabilities, if characters are represented internally in their fullest form, they will be available for more sophisticated uses in the future when the search and display technologies become more widely available.

Libraries always have taken a long-term perspective on preserving and providing access to information. They manage content in many languages and cooperate as an international community to exchange data in digital form. Thus it is not surprising that libraries were among the first institutions to tackle the multi-lingual character set problem. Over the last 30 years, libraries have created massive stores of digital data. Not only do libraries create and maintain new bibliographic records in digital form, a growing number of the world's major research libraries have converted all of their historical records -- sometimes dating back several hundred years -- into a common record structure. By now, libraries have the expertise and influence to affect future developments in standards for character sets and other factors in data exchange.

The library world is changing, however, as new regions of the world come online. The European Union is promoting Unicode and funding projects to support Unicode implementation in library automation (Brickell, 1997). Automation in Central and Eastern Europe (CEE) has advanced quickly since 1990 (Borgman, in press). A survey of research libraries in six CEE countries, each with its own national language and character set, indicates that a variety of coding systems are in use. As of late 1994, more than half used ASCII Latin2, one used Unicode, and the rest used a national or system-specific format; none used the ALA character set (Borgman, 1996). The national libraries in these countries are responsible for preserving the cultural heritage of their countries that appears in published form, and thus require that their language be preserved in its most complete and authoritative digital form. Transliterated text or characters stripped of diacritics are not acceptable. Several of these national libraries are now working closely with OCLC, toward the goal of exchanging data in authoritative forms.

As libraries, archives, museums, and other cultural institutions throughout the world become more aware of the need to preserve digital data in archival forms, character set representation becomes a political as well as technical issue. Many agencies are supporting projects to ensure preservation of bibliographic data in digital forms that can be readily exchanged, including the Commission of the European Communities, International Federation of Library Associations, Soros Foundation Open Society Institute Regional Library Program, and the Mellon Foundation (Segbert & Burnett, 1997).

Summary and Conclusions

Massive volumes of text in many languages are becoming available online, whether created initially in digital form or converted from other media. Much of this data will be stored in digital libraries, whether alone or in combination with sounds and images. Digital formats are no longer viewed as an intermediate mechanism for transferring data to print, film, tape, or media. Rather, they have become permanent archival forms for many applications, including digital libraries. DL content is used directly in digital form -- searched, processed, and often reformatted for reuse in other applications. Data is exchanged between DLs, whether in large batch transfers, such as tape loads between bibliographic utilities and online catalogs, electronic funds transfers between financial institutions, or as hyperlinks between DLs distributed across the Internet. In networked environments, searchers speaking many different languages, with many different local hardware and software platforms, may access a single digital library. For all of these reasons, we need to encode characters in a standard form that can support most of the world's written languages.

The first step is for designers of digital libraries to recognize that the multi-lingual character set problem exists. The goal of this essay, and the choice of publication venue, is to bring the problem to the attention of a wider audience than the technical elite who have been grappling with it for many years now. The second step is to take action. The solution will not come overnight, but given the great strides already taken toward platform-independent network applications, and toward standards for exchanging sounds and images, the foundation for progress has been laid.

Designers of networked applications are more aware of interoperability, portability, and data exchange issues than in the past. Experience in migrating data from one application to another provides object lessons in the need to encode data in standard formats. Unicode appears to be the answer for new applications and for mapping legacy data from older applications. However, designers still must weigh factors such as the amount of data currently existing in other formats, the standards in use by other systems with which they must exchange data regularly, the availability of application software that supports Unicode and other universal standards for encoding character sets, and the pace at which conversion will occur. The sooner that the digital library community becomes involved in these discussions, the sooner we will find a multi-media, multi-cultural, and multi-lingual solution to exchanging data in all written languages.

References

Agenbroad, J. E. (1992). Nonromanization: Prospects for Improving Automated Cataloging of Items in Other Writing Systems. Cataloging Forum, Opinion Papers, No. 3. Washington, DC: Library of Congress.

The ALA character set and other solutions for processing the worlds information. (1989). Library Technology Reports, 25(2), 253-273.

Aliprand, J.M. (1992). Arabic script on RLIN. Library Hi Tech, 10(4), Issue 40, 59-80.

Borgman, C.L. (1996). Automation is the answer, but what is the question? Progress and prospects for Central And Eastern European Libraries. Journal of Documentation, 52(3), 252-295.

Borgman, C.L. (In press). From acting locally to thinking globally: A brief history of library automation. Library Quarterly. (To appear July, 1997)

Borgman, C.L.; Bates, M.J.; Cloonan, M.V.; Efthimiadis, E.N.; Gilliland-Swetland, A.; Kafai, Y.; Leazer, G.L.; Maddox, A. (1996). Social Aspects Of Digital Libraries. Final Report to the National Science Foundation; Computer, Information Science, and Engineering Directorate; Division of Information, Robotics, and Intelligent Systems; Information Technology and Organizations Program. Award number 95-28808. <http://www.gslis.ucla.edu/DL/>

Bossmeyer, C.; Massil, S.W. (Eds.). (1987). Automated systems for access to multilingual and multiscript library materials : problems and solutions : papers from the pre-conference held at Nihon Daigaku Kaikan Tokyo, Japan, August 21-22, 1986. International Federation of Library Associations and Institutions, Section on Library Services to Multicultural Populations and Section on Information Technology. Munich and New York: K.G. Saur.

Brickell, A. (1997). Unicode/ISO 10646 and the CHASE project. In M. Segbert & P. Burnett (eds.). Proceedings of the Conference on Library Automation in Central and Eastern Europe, Budapest, Hungary, April 10-13, 1996. Soros Foundation Open Society Institute Regional Library Program and Commission of the European Communities, Directorate General XIII, Telecommunications, Information Market and Exploitation of Research, Libraries Programme (DG XIII/E-4). Budapest: Open Society Institute.

Croft, W.B. (1995). What do people want from information retrieval? (The Top 10 Research Issues for Companies that Use and Sell IR Systems). D-Lib Magazine, November. <http://www.dlib.org/november95/11croft.html>

Kahin, B.; & Abbate, J. (eds.). (1995). Standards policy for information infrastructure. Cambridge, MA: MIT Press.

Libicki, M.C. (1995). Standards: The rough road to the common byte. B. Kahin & J. Abbate (eds.), Standards policy for information infrastructure. MIT Press: Cambridge, MA, 35-78.

Számítástechnikai karakterkódok. A grafikus karakter magyar referenciakészlete. (1992). Budapest: Magyar Szabványügyi Hivatal. (Character sets and single control characters for information processing. Hungarian Reference version of graphic characters. Budapest: Hungarian Standards Office.)

Mitchell, J. (1994). OCLC Europe: Progress report, March, 1994. European Library Automation Group Annual Meeting, Budapest.

OCLC Online Computer Library Center, Inc. (1993). Furthering access to the world's information (Annual Report 1992/93). Dublin, OH: Author.

Segbert, M. & Burnett, P. (eds.). (1997). Proceedings of the Conference on Library Automation in Central and Eastern Europe, Budapest, Hungary, April 10-13, 1996. Soros Foundation Open Society Institute Regional Library Program and Commission of the European Communities, Directorate General XIII, Telecommunications, Information Market and Exploitation of Research, Libraries Programme (DG XIII/E-4). Budapest: Open Society Institute.

Smith, K.W. (1994). Toward a global library network. OCLC Newsletter, 208, 3.

cnri.dlib/june97-borgman

Multi-Media, Multi-Cultural, and Multi-Lingual Digital Libraries

Or How Do We Exchange Data In 400 Languages?

ISSN 1082-9873

Introduction

Medium, Culture, and Language

From Local Systems to Global Systems

Language and Character Sets

Library Community Approaches

Summary and Conclusions

References

Copyright ©1997 Christine L. Borgman