Dublin Core in Multiple Languages:
Esperanto, Interlingua, or Pidgin?

Thomas Baker
Asian Institute of Technology
Bangkok, Thailand

Abstract

The experience of artificial languages like Esperanto suggests they need good governance to control divergence in usage, but flexibility to evolve and grow. Language engineers have neglected to consider pidgins --- simplified hybrids invented spontaneously by speakers of different languages. If Dublin Core is pidgin metadata, perhaps it needs an interlingua --- a language-neutral set of elements mediating between richer sets --- for the collective negotiation of meanings and for managing the inevitable tension between simplicity and complexity. Adaptations of Dublin Core in languages other than English would not be mere translations of a canon, but equal participants in an ongoing revision of that canon.

Keywords

Dublin Core, artificial languages, knowledge representation, metadata, multilinguality, ontologies, pidgins and creoles, thesauri.

1. Simple versus complex

The Dublin Core is ``an ongoing effort to form an international consensus on the semantics of a simple description record.''[17] It defines fifteen core elements --- Creator, Title, Publisher, Subject, and so on --- for describing ``document-like objects.'' Designed to be simple enough for people unschooled in the science of cataloging to tag their documents for indexing by Web harvesters such as Alta Vista, it is being adapted for a wide range of applications. Some users see Dublin Core as a simple replacement for richer description formats, such as USMARC for library catalogs. Some want to use it as a simple interface to these richer systems. But others would like to take Dublin Core as a framework within which to construct formats as rich and elaborate as any others available today (``Dublin Core on Steroids'').

1.1 The problem with sub-structure

Dublin Core's community of early adopters falls roughly into two pragmatic camps, Minimalists and Structuralists. Minimalists value the generic simplicity of the fifteen basic categories and propose to use them ``as is'' for coarse-grained indexing and retrieval by Web harvesters. Structuralists want to customize Dublin Core for particular uses and increase the precision of retrieval by specifying narrower semantics for the elements. For example, they want to qualify an element's name to specify that a given Creator is a Composer as opposed to an Author.

In practice, however, Structuralists have proposed qualifiers that in effect extend the semantics of elements rather than just narrowing them. For example, they want to extend the Creator element to include an author's Affiliation. Such an extension could be very useful for customizing the Dublin Core for a particular local use. However, there are logical problems with this: an Affiliation is not a kind of Author; rather, an Affiliation is something an Author has. An uncontrolled proliferation of such qualifiers would muddy the semantics of Dublin Core. And since Web harvesters are unlikely to recognize all of these qualifiers, they would either have to ignore any qualified elements (and not index Affiliation) or else go ahead and index Affiliations as if they were names of Creators. In either case, the precision of retrieval would suffer. We will return below to this problem with sub-structure.

1.2. Dublin Core in other languages

Until now, the Dublin Core has been defined and implemented only in English. But its fifteen categories are broad enough in scope that they could probably be expressed just as well in any other language, such as Arabic or Japanese. Publisher and Resource Type are not among the core terms found in all languages of the world, as are boy, drink, or water. But as linguists now generally agree, technical terms that have no exact translation equivalents in other languages can in principle be explained with a phrase, or new terms can be defined.[5] There is no reason to think that anything in Dublin Core is inherently untranslatable.

So one might take the Dublin Core in English as the canonical version and simply prepare translations in multiple languages. This has already been done for German and Thai.[13,14] And there are precedents for this among the library standards. The guidelines for Universal Standard Bibliographic Description (USBD) are available in many translations. Universal Decimal Classification (UDC) and Dewey Decimal Classification (DDC), which aim at achieving multilingual universality through their language-neutral, numerical notations, have both been translated into many languages --- thirty for DDC, which is used in 135 countries. However, such systems must continually be revised as new knowledge develops. And in practice, this means revising a canonical version, usually in English, and accepting lengthy delays while translations are prepared.

This paper argues that this need not be the model for making Dublin Core operational across multiple languages. Rather than treat local instantiations of Dublin Core in multiple languages as mere translations of a canonical version (``Dublin Core with Sub-Titles''), one could treat them as equal participants in an ongoing process of collective negotiation and revision.

1.3. Is Dublin Core a Holy Grail?

A description model ensuring semantic interoperability across many languages and scientific disciplines, simple enough for non-experts, small enough to be translated into every language, yet flexible enough to be customized locally with qualifiers in Japanese, German, and Thai --- could Dublin Core be the Holy Grail that has long eluded universalists in the library world? It very well could be, were it not for the problem with sub-structure. And one is reminded of Esperanto, the proposal for a universal language which in over one hundred years has yet to become much more than a curiosity. Dublin Core resembles Esperanto inasmuch both are simplified, universalist linguistic systems advocated by movements that hope to improve international compatibility. Why should Dublin Core succeed where Esperanto has not?

2. Universalist Esperantos

Esperanto is only one of a (conservatively) estimated five hundred artificial languages to be proposed since the seventeenth century. These languages were for the most part products of two distinct periods: the seventeenth century, mainly in London and Oxford; and after 1875, mainly in Europe and the United States.

2.1. Systems invented a priori

The products of the seventeenth century are collectively labeled a priori because the emphasis was on constructing languages ex novo on the basis of philosophical principles. Latin was in decline, so men of learning were seeking a new universal language for science, trade, and missionary activities. The prospect attracted the likes of Bacon, Descartes, and Leibniz.

Theologians of the day liked to speculate about the divine dialect mankind had shared before the Tower of Babel and about the iconic, almost telepathic communion of angels. Plans were devised whereby mankind once again might praise God in a single tongue. There was much interest in musical notation, numerals, shorthand, and ideograms, which seemed to encode universal concepts independently of language. The languages that resulted from such considerations often used invented symbols or notation. Some were based on comprehensive taxonomies and precise rules, the better to reflect the orderliness of creation. Most were designed to be written, though some could also be pronounced.

John Wilkins (1668) arranged ``all things and notions'' in a large chart under forty main headings, according to which each thing and notion was assigned an artificial word; the spelling of his word for dog reflected its position under beast, viviparous, rapacious, and dog-like.[15] George Dalgarno's philosophical language (1661) differentiated the meanings of artificial words by altering vowels and inserting consonants according to complex rules.

Missionaries returning from Asia reported that speakers of Mandarin and Cantonese could not understand each other's speech, yet shared a common script, which also was used extensively in Korea, Vietnam, and Japan. The notion took root that ``ideographs'' could convey ideas directly to speakers of totally unrelated languages, much like the arabic numeral 3 is immediately understandable to a Russian and a Spaniard. (This misconception has been soundly refuted --- Chinese characters are basically phonetic and thus no more universal than the Latinate roots shared by European languages.[3])

The urge to create languages a priori did not survive as a movement beyond the seventeenth century, though related proposals have appeared sporadically since then. A ``universal musical language'' of the early nineteenth century, Solresol, constructed word semantics with sequences of diatonic notes that could be whistled, played, or spoken. More recently, Lincos (1960) was devised for ``cosmic intercourse'' with intelligent life in distant galaxies. And Margaret Mead (1968) called on scholars to create a language-neutral script for communicating high-level scientific concepts.[10] In a general sense, one might see formal logic notation, Macintosh icons, signage at airports and sporting events, scientific nomenclature, thesauri, and the composite codes of Dewey decimal classification as direct or spiritual descendants of this movement.

2.2. Languages synthesized a posteriori

The languages designed after about 1875 are called a posteriori because they were based largely on the comparison and synthesis of existing natural languages. Most of them used words or roots from Western European languages along with simple syntax and grammar. Their rise in popularity coincided with the expansion of European colonial empires overseas. At the time, linguistic diversity was seen as a cause of international friction, but returning to Latin was not an option, and agreement on English or French seemed politically unlikely, so it was widely believed that the interests of science, progress, and peaceful coexistence would best be served by settling on an international auxiliary language.

The first proposal to achieve much success was Volap\"uk or ``World Speak'' (1879), a German priest's creative synthesis of German, English, and Latin. Morphologically complex, Volap\"uk had over half a million possible verb forms. Its decline in popularity coincided with the rise of Esperanto (1887), a simpler language with a more Slavic flavor to its syntax and spelling. The decades that followed saw the creation of many more such languages -- several of them modifications of Esperanto, such as Ido (1907), and Novial (1928).

Instead of inventing artificial languages, others tried to simplify the grammar and vocabularies of existing natural languages. Kolonialdeutsch (1916) was intended to be sufficient for German masters giving orders to ``natives,'' but not complete enough to allow the latter to eavesdrop or debate among themselves. Basic English (1930) offered a list of 850 English words --- short enough to be printed ``on a single sheet of business notepaper'' yet long enough to express all the ``root ideas'' needed for practical communication. Its author proposed to promote its use through phonograph records and International Basic News on the radio.

2.3. Why artificial languages didn't fly

With the qualified exception of Esperanto, none of the offerings mentioned above are used much today. The a priori systems, in particular, were difficult to memorize, let alone write or pronounce. Many of these systems began with simple principles but quickly became arbitrary and complicated. The systems based on taxonomies could not easily be extended with new concepts and only ever appeared rational to those who shared their philosophies. None of the a priori languages attracted a stable community of users.

The a posteriori languages were a bit more flexible and adaptable. Most were created by a single author, working in isolation, then adopted by a small circle of followers. As the movements grew, early users had noisy disagreements about whether to accommodate new words or constructions. Debates often reflected the conflicting requirements of the everyday needs of speakers versus the demands of specialists. The Volap\"uk movement split over a conflict between, in effect, Minimalists and Structuralists: its inventor, Johann Martin Schleyer, wanted Volap\"uk to express the full range of semantic distinctions of natural languages, while some of his followers wanted to simplify it so as to improve its chances of adoption as an international auxiliary. The Esperanto movement, founded by the Polish philologist Ludwik Lejzer Zamenhof, likewise argued over issues such as its use of the circumflex, and factions broke off to promote alternative versions. Umberto Eco concludes that ``Such seems to be the fate of artificial languages: the `word' remains pure only if it does not spread; if it spreads, it becomes the property of the community of its proselytes, and (since the best is the enemy of the good) the result is `Babelization'.''[4]

Artificial languages never succeeded in getting the support of a government, though the United Nations briefly considered adopting Esperanto. Die-hard exponents of Esperanto still believe that its best chances lie in marketing it only as an auxiliary language, promoting its use in mass media, and forming an international supervisory association to maintain standards, review proposals, and control the language's evolution. As Eco points out, past failures do not mean there will be no attempt to find political consensus for such an auxiliary in the future. Were this to happen, he speculates, success would depend on instituting control from above, though not so tightly as to stifle the auxiliary language's capacity to express new everyday experiences.[9]

One would then need to resolve just how the institutional control from above would relate to the natural change in usage from below. Two scholars of ``language engineering,'' Donald Laycock and Peter M\"uhlh\"ausler, suggest the path to an answer. Natural languages, they point out, are versatile and open-ended, whereas most invented languages were designed as closed, rather strictly governed by rules, lacking in ``linguistic naturalness,'' and ill-suited to change. To achieve success, they argue, language designers should provide for peoples' propensity to change or create rules, adapt systems, and negotiate meanings. And to make progress along these lines, they conclude, language engineers need to examine how communities of users interact spontaneously to create pidgins.[8]

3. Pidgins as makeshift hybrids

Pidgins are makeshift, hybrid languages that arise when speakers of different languages in regular but superficial contact must work together or conduct trade. They usually have small vocabularies (borrowed largely from a socially dominant group), little inflection, and loose word order. Emphasis is achieved with reduplication and gestures. In the absence of grammatical precision, speakers must sometimes resort to elaborate circumlocutions, and usage is inconsistent between speakers. Historically, pidgin languages have arisen among hired or slave workers on ethnically mixed plantations, though ``pidginization'' occurs continuously in holiday resorts, port cities, and immigrants' workplaces.

3.2. Creoles as complexified pidgins

As a pidgin becomes more valuable to its users, for example to conduct business, it stabilizes, its vocabulary expands, and it becomes flexible enough to be used as a speaker's primary language. As such, it must meet all of a speaker's linguistic needs, so it acquires prepositions, more words, and a syntax less dependent on context. Researchers find that when children are raised using a pidgin as their mother tongue during the critical period before adolescence, they use their instinctive language skills to improvise further grammatical complexity, transforming their parents' crude pidgins into grammatically richer, more expressive ``creoles.'' Creoles are bona fide languages, with subtle grammatical markers and consistent word orders.

Steven Pinker, a linguist in the Chomskian tradition, cites an example. After 1979, the new Sandanista government in Nicaragua created special schools for the country's deaf students. Between classes on lip-reading, the ten-year-old children invented a pidgin sign language on the playground. But when younger pupils, aged four and older, learned this pidgin from their elders, they came to sign more fluently and efficiently. It appears that the younger children improvised and standardized a sign language creole. Pinker concludes that, as with other artificial languages created by theoreticians, ``Educators [...] have tried to invent sign systems, sometimes based on the surrounding spoken language. But these crude codes are always unlearnable, and when deaf children learn from them at all, they do so by converting them into much richer natural languages.''[12]

This process of complexification is not always just the work of children. Tok Pisin is a well-documented example of a pidgin that has become stabilized and extended without undergoing full creolization. Over the past century, it has become a lingua franca to about 1.5 million people and is the primary language of debate in the parliament of Papua New Guinea.

3.2. Dublin Core as pidgin metadata

The Dublin Core element set is certainly not a full language, but it does seem akin in character and function to an artificial language. Indeed, the conditions from which the two arose at particular times are in some respects comparable: A posteriori languages became an issue at a time of global economic expansion, as did international librarianship and Dewey Decimal Classification. Dublin Core took hold as an idea in the year when the general public first became aware of the Internet. It is fitting that Dublin Core should have started as a hall conversation at the Second International World-Wide Web Conference in Chicago (1994).

However, the process of proposing, refining, and elaborating Dublin Core has been significantly different than for Volap\"uk and Esperanto. Unlike Schleyer and Zamenhof, Stuart Weibel has defined his role less as inventor than as facilitator of a process. That process has benefitted from the unprecedented availability of email, Web sites for shared drafts, mailing lists, and cheap air travel to sustain the interaction of several dozen scholars and practitioners and hundreds more interested observers in negotiating the emerging standard.

At one of the early workshops, Ricky Erway defined Dublin Core as a phrase book for the ``virtual tourist'' who needs to browse collections in unfamiliar fields. The metaphor is especially apt because tourists are inclined to pidginize. Dublin Core is like a set of pidgin metadata elements created by natives of different user communities. One might carry the analogy a step further and suggest that the customization of Dublin Core with qualifiers for specific user communities represents its creolization. Or if creolization is too strong a word, since it implies the action of an inbuilt bioprogram for language acquisition, one might speak at least of the pidgin's stabilization and extension, as with Tok Pisin.

Either way, real pidgins are living languages that continually evolve through use in public speech and the mass media. If pidgin metadata is not to be constrained too tightly by its own rules from evolving naturally, it will need a mechanism that supports such collective, ongoing negotiation. This mechanism could resemble an interlingua.

4. Language-neutral interlinguas

An interlingua, in the sense intended here, is a language-neutral construct used to establish the basic semantic relations between concepts in several languages. This is not to be confused with Latino Sine Flexione aka Interlingua (1903), a sort of ``pidgin Latin,'' nor with the Interlingua developed between 1924 and 1951 by the International Auxiliary Language Association.[2] Nor is it the same as ``interlanguage,'' a term linguists use for the flawed, hybrid product of an adult's struggles to learn a second language. Somewhat closer in meaning are attempts to identify a small number of core concepts (``features'') that can be used to define all of the words in a language.[5] Such ``parametric'' systems have been proposed for use in machine translation, whereby one might translate an expression from language A into language B by equating both to a universal expression in metalanguage C.[4]

4.1. Interlingua as a link between wordnets

The interlingua I have in mind is that designed to support cross-language retrieval in the project EuroWordNet.[16] The EuroWordNet system has a set of monolingual wordnets (ontologies) containing all of the basic words of Spanish, English, Dutch, and Italian. Internally, each wordnet consists of clusters of words semantically linked to one another by relationships such as ``same as,'' ``kind of,'' and ``part of'' (the full set of relationships makes subtler distinctions, such as near-synonym, sub-event, and role).

To integrate these wordnets into a single system, EuroWordNet considered linking them all pair-wise. However, this would have multiplied work by the number of languages to be linked, making it hard to scale up to additional languages and a nightmare to maintain. They also considered linking all of the other monolingual wordnets to the English wordnet. However, the lexical configurations and semantic scopes specific to the various languages would have been lost in trying to map them onto any one of the languages. For example, the Italian word dito refers to both fingers and toes. No one language incorporates the subtleties of all the others.

Rather, they decided to link the monolingual wordnets to an interlingua --- a flat, unstructured superset of concepts found in all of the languages. Words are linked to the closest meanings in the interlingua via shared equivalence or near-equivalence relations. Figure 1 illustrates that lions are mammals, and lions have paws and mane. The Dutch, Spanish, English, and French words for mammal, lion, paw, and mane are established as synonyms of each other through their parallel links to the respective concepts in the interlingua.

Within the interlingua, these concepts are not linked semantically among themselves, for their positions in language-specific lexical configurations may differ; no such links could do equal justice to all languages. This design maintains the richness and diversity of the various languages within their respective wordnets, while supporting some useful semantic fuzziness in the links between them (for example, dito is linked to fingers, toes, and fingers-and-toes).

4.2. Dublin Core as a structured interlingua?

One might conceptualize Dublin Core as a similar sort of interlingua (see Figure 2[6]). Only instead of serving as a bridge between wordnets, it would connect both to richer description formats, such as GILS and USMARC, and to Dublin Cores in other languages or in locally customized versions. And its purpose would be different. Euro\-WordNet seeks to link existing wordnets to an interlingua ``bottom up,'' so structure within the interlingua is not primarily at issue. A Dublin Core interlingua could define a complete system with its own internal logic. Some of its sub-elements would have been invented specifically for Dublin Core; others would be acquired from richer element sets via crosswalks. Defining the semantic relations among these sub-elements within Dublin Core would serve to guide indexing and harvesting.

4.3. Harvesting sub-structure

Some Web harvesters, specifically oriented to Dublin Core, will surely be set up to exploit the extended semantics of richly qualified elements. One might call these Structuralist Harvesters. As shown in Figure 3, they would correctly index T. Baker, baker@gmd.de, Prof., and 1957}. In contrast, a Minimalist Harvester might limit itself to sub-structure that represents ``semantic narrowings'' (hyponyms) of the fifteen core elements and focus merely on ensuring that T. Baker is not left out of the Creator field simply because he has been qualified an Author.

But where would this leave Generic Harvesters? Global services of the Alta Vista sort will probably not want to program their indexing robots to process all of the sub-structure that people will build into their Dublin Core metadata. Would they index T. Baker, baker@gmd.de, Prof., and 1957 all together under Creator? This would pollute the Creator element with undifferentiated email addresses, birthdates, affiliations, and terms of rank. Or would they index only the Author --- one of the top ten most popular qualifiers, let's suppose --- and ignore the rest? In the absence of simple conventions to make such distinctions, Dublin Core metadata could become quite messy.

4.4. Managing complexification

The EuroWordNet project has a strict update procedure for adding language-specific concepts as new entries in the interlingua. Sites that can find no equivalent for a given word generate a potential new entry, complete with a clear definition in English. Using split-screen navigation tools, someone periodically checks these suggestions for overlap with existing interlingua entries and draws up formal recommendations for revisions. When the interlingua is updated, all sites are supposed to examine the new concepts for additional links to their own. Having the interlingua as their focus, the maintainers of the various wordnets communicate about revisions one-to-all, not many-to-many.

This is not unlike a procedure suggested by John Kunze for managing the evolution of Dublin Core.[7] Kunze envisioned a canonical Core, along with a mechanism for announcing local or experimental extensions and a formal review and approval process for accepting them into the canon. Maintainers of the Core would need to examine proposed additions for overlap and conflict with existing sub-elements. As in EuroWordNet, related terms of significantly different scope could be registered in the interlingua side-by-side. Their multiple definitions would appear as alternates, as in a natural language dictionary. The interlingua would consist of a stable base of approved elements surrounded by an evolving set of elements in less formal use. All implementers of Dublin Core in whatever form or language could participate by posting or proposing new sub-elements.

This process could be overseen by something like the Usage Panel of the American Heritage Dictionary, whose 173 writers, critics, and scholars help its editors find a balance between descriptions of actual usage and prescriptions of preferred forms, while evaluating potential entries against ``the fundamental linguistic virtues --- order, clarity, and conciseness.''[11]

Unless the Dublin Core community were to adopt some language-neutral way to express element names, such as the numbers of Dewey Decimal Classification, it would seem expedient to follow EuroWordNet in using English. The glosses for concepts in EuroWordNet's interlingua read like English dictionary definitions, such as a finger-like part of vertebrates, or any substance that can be metabolized.

And in the absence of convincing implementations of, say, SGML tags in multiple languages, it seems practical to name the sub-elements in English too. Or if the Web's future Resource Description Framework were to support it, perhaps elements could be identified with both a universal name (its Dublin Core name, in English) and a local name in the local language. This way, designers of local systems could invent element names for local uses independently of Dublin Core, then fill in the blanks for universal names as matches to Dublin Core were identified.[1]

Such questions will be resolved in the marketplace of practice. The interlingua could constitute this market's forum --- both a reference model for users and harvesters and the locus of ongoing evolution.

References

[1]Thomas Baker. Metadata Semantics Shared across Languages: Dublin Cores in Languages Other than English. http://www.cs.ait.ac.th/~tbaker/Cores.html, 1997.

[2]David Crystal. Artificial languages. In: The Cambridge Encyclopedia of Language. Cambridge (Eng): Cambridge University Press, pp. 352-356, 1987.

[3]John DeFrancis, The Chinese Language: Fact and Fantasy. Honolulu: University of Hawaii Press, 1984, p. 159.

[4]Umberto Eco. The Search for the Perfect Language. Oxford: Blackwell, 1995, pp. 319, 346.

[5]Joseph H. Greenberg. A New Invitation to Linguistics. Garden City (NY): Anchor Books, 1977, pp. 57, 126.

[6]Jon Knight and Martin Hamilton. Dublin Core Qualifiers, ROADS Project, Department of Computer Studies, Loughborough University, http://www.roads.lut.ac.uk/Metadata/DC-Qualifiers.html, 1997.

[7]John Kunze. A Unified Element Vocabulary for Metadata. http://www.ckm.ucsf.edu/personnel/jak/dist.html, 1996.

[8]Donald C. Laycock and Peter M\"uhlh\"ausler. Language Engineering: Special Languages. In: An Encyclopaedia of Language. London: Routledge, pp. 843-875, 1994, p. 871.

[9]Andr\'e Martinet, 1991. Cited in Eco, p. 332.

[10]Margaret Mead and Rudolf Modley, 1968. Cited in De Francis, p. 164.

[11]Geoffrey Nunberg. Usage in the American Heritage Dictionary: the Place of Criticism. In: The American Heritage Dictionary of the English Language, Third Edition. Boston: Houghton Mifflin Company, Pp. xxvi-xxx, 1992.

[12]Steven Pinker. The Language Instinct. New York: Harper Collins, 1994, p. 36.

[13]Diann Rusch-Feja. Dublin Core Version 1.0 in German. http://www.mpib-berlin.mpg.de/DOK/metatagd.htm, 1996.

[14]Praditta Siripan. Dublin Core in Thai. National Science and Technology Development Agency, Bangkok, Thailand, 1997.

[15]J.L. Subbiondo. Universal Language Schemes in Seventeenth-Century Britain. In: Encyclopedia of Language and Linguistics, Vol. 9, pp. 4841-4845, Oxford: Pergamon, 1994.

[16]Piek Vossen, Pedro Diez-Orzas, Wim Peters. Multilingual design of EuroWordNet. http://www.let.uva.nl/~ewn/Vossen.ps, 1997.

[17]Stuart Weibel, Renato Iannella, Warwick Cathro. The 4th Dublin Core Metadata Workshop Report. D-Lib Magazine, June 1997, http://www.dlib.org/dlib/june97/metadata/06weibel.html, 1997.