Technical Aspect of Next Generation Digital Library Project

Hiroshi Mukaiyama
Japan Information Processing Development Center (JIPDEC)
Tokyo,Japan
hiro@jipdec.or.jp

Abstract

Digital libraries are one of the central and most compelling applications for the 21st century's highly information-based societies. The development of such system needs three kind of technologies. First one is a system architecture that defines overall system structure and provides common services and interfaces. Second one is individual technologies that include search technology, retrieval technology, contents entry technology and so on. Third one is an integration technology that enables to combine individual technology as a system on the system architecture. The system architecture that plays a central role should be designed to have a interoperability to the international standards and de fact standards. Because digital libraries have to be open and inter-connectable.

Based on the submission to the request for proposal, we are now developing the following technologies: system architecture that consists of messaging architecture, agent architecture multimedia database architecture, application architecture; individual technologies such as digital conversion of documents, intellectual information retrieval agent, selective information distribution agent, concept-based text retrieval, hypermedia retrieval using 3D visualization, content-based retrieval for video; integration technologies such as contents entry framework.

This paper first describes the system architecture and next individual technologies built on the system architecture.

Keywords

Messaging architecture, agent architecture, multimedia database architecture, application architecture, agent-based retrieval, concept-based retrieval, hypermedia retrieval, video data retrieval, contents entry, CORBA, WWW, DMA, SQL3.

1 Introduction

The development of the digital library (DL) requires the following technologies:

Contents processing technology
Technology that provides effective creation, storage, and retrieval of primary information and secondary information: including digital conversion from conventional, non-digital media.

Information access technology
Technology that enables efficient accesses to myriad types of information without time or location limitations.

Human-friendly, intelligent interface
User interface that brings, to diverse users, increased intellectual productivity and an improvement to the active cultural environment.

Interoperability
Technology to make interoperable works possible in heterogeneous environments.

Scalability
Technology that enables DL systems to handle increases in information and users.

Open system development
Development using international and de facto standards, without loss of performance.

Highly flexible system development
Technology that can adjust quickly to new information and related changes to social systems.

After the preliminary study of these technologies, we issued the request for proposal in the public. The RFP contained the system architecture development based on three-layer model, individual technology development, and integration technology. Three-layer model separates the system's functions into the Presentation layer, Function layer and Data layer. It will bring high flexibility and high extendibility. Each technology development was requested to use the three-layer model and object oriented technology.

Through the evaluation of the proposal, we selected the following development: the system architecture that consists of message architecture, agent architecture, multimedia database architecture and application system architecture; contents entry frame work that incorporates the entry functions as components; advanced contents entry technologies; advanced search and retrieval technologies; integration technologies to combine each technologies as a DL system.

We are now developing these technologies by forming a project that consists of Hitachi, Fujitsu, NEC, IBM Japan, Toshiba, Mitsubishi Electric, Oki Electric Industry, Nihon UNISYS, and Ricoh.

This project is getting a fund from Ministry of International Trade and Industry (MITI) between fiscal year 1996 and fiscal year 1999.

2 System Architecture

2.1 Architecture and the Three Layer Model

In developing an architectural model, we planed to develop a common architecture which can be used as the basis of all systems in a future information processing society. Because DL systems will co-exist with EC (electric commerce) systems, CALS (Commerce At Light Speed) systems, education systems and so on. The architecture must have flexibility and extensibility to make it possible to combine multi-vendor applications into one system. As a result , we developed an architecture model that is based on the 3-layer model and the distributed object oriented model.

The system architecture is consisted of a basic architecture and application architecture. The basic architecture has three sub-architectures (the Messaging Architecture, Agent Architecture, and Database Architecture) and the reference model that enables mapping of the sub-architectures onto a basic architecture.

2.2 Messaging Architecture

Messaging Architecture provides many services for communicating within and between each layer of the 3-layer model. These services include the synchronous messaging service and the asynchronous messaging service. Our messaging services are based on the CORBA (Common Object Request Broker Architecture) specification from OMG (Object Management Group), and use HTTP (HyperText Transfer Protocol) for communications across the Internet.

Within the Presentation layer, the synchronous messaging service provides an HTTP-based communication service with an optional CORBA-based service. A service within the Function layer, and between the Function and Data layer, provides a CORBA-based communication service. If a CORBA message needs to go through another messaging platform, or if you need interoperability with a system constructed on another messaging platform, a gateway (also called a proxy object) will convert the protocol.

Within the Presentation layer, the asynchronous messaging service provides the mail service if no CORBA-based service is provided within the layer. When a CORBA-based service is provided, the asynchronous service provides an asynchronous extension service of CORBA. The service within the Function layer and between the Function layer and the Data layer also provides an asynchronous extension service of CORBA.

When the synchronous messaging service within the Presentation layer provides a CORBA-based service, a CORBA message can be sent from the Presentation layer to the Function layer. The WWW CORBA gateway in the Function layer makes it possible to send a pseudo CORBA message from the Presentation layer to the Function layer by translating an HTTP message from the Presentation layer into a CORBA message.

Messaging architecture provides the following services as primitive services for DL systems: Event Notification, Lifecycle, Naming, Transaction, and Security. These services provide common functions and common interfaces to all the functional objects which make up a DL system.

2.3 Agent Architecture

Agent architecture was introduced for the following purposes:

To provide interoperability in global distributed environments and flexibility of system configuration.
To enable rapid application development by improving the reusability of software components.
To automate routine work, enable user customization, and lower maintenance costs.

An agent, in our definition, is an autonomous software module that cooperates with other agents by mutual message-based communications. Agent Architecture has a place concept as a virtual space which provides basic facilities for agent activation. Several places can be created on a network. An agent can move to any other place by using the migration facility in the current place.

A place provides the following services:

Lifecycle Manager, which manages control of status and execution of all the agents in the place.
Agent Communication Bus (ACB), which provides message routing and queuing facilities.
Security Service, which provides user authentication and controls access to any resource in the place.
Directory Service, which provides information about the facilities of each place.
Migration Service, which executes agent migration to another place.

ACB supports synchronous and asynchronous one-to-one communications. Messages are coded in KQML (Knowledge Query and Manipulation Language).

When a message has no address, the Facilitator finds a suitable receiver agent in accordance with the content and ontology specified in the message. The Facilitator also provides a function for multicasting information to interested agents. The main features of the Facilitator are:

Publishing and subscribing
A client agent requests information monitored by the Facilitator. A server agent subsequently notifies the Facilitator of the information corresponding to the request. The Facilitator can then notify the client agent.

Recommending
The Facilitator can find a suitable agent and notify a message sender of the location and identifier of the agent.

Brokering
The Facilitator can find an agent which can provide a requested service, and can then send a message to the agent. A reply message corresponding to the request is also transferred to the original agent.

The Facilitator uses a match-making algorithm to provide these feature. Users can customize this match-making algorithm, which increases the flexibility of the message routing function.

2.4 Mobil Agent Facility

The Mobile Agent Facility enables an agent to move from one logical (or physical) node to another logical (or physical) node. The Mobile Agent Facility is constructed based on the Agent architectures. The Mobile Agent Facility was designed to provide system flexibility, reduce network traffic, and improve system convenience.

An agent can move among nodes by using the Migration Service in the Agent Architecture. The Migration Service provides agent mobility by using the Lifecycle Service, Persistent Memory Service, Security Service, and the Directory Service in the Agent Architecture.

The Migration Service is implemented as a CORBA object in the Messaging Architecture and has the following two methods:

migrate method, which transfers the agent to another logical node.
move method, which transfers the agent to another physical node.

The migration models are defined as follows:

One-way model
In the one-way model, an agent can move from an original node to a destination node; however, the destination node must be different from the original node. The original node is the node where the agent was first created.

Synchronous model
In the synchronous model, a child of the original agent is created when the original agent requests migration. The child agent then moves to the destination node to execute its own task, and then moves back to the original node. The child agent then reports the execution results to the original agent and terminates. Until the original agent obtains execution results from the child agent, the original agent is blocked.

Asynchronous model
The asynchronous model is basically the same as the synchronous model; however, in the asynchronous model the original agent is not blocked. After execution of its own task, the child agent moves back to the original node. The child agent then writes the execution results in persistent memory and terminates. The original agent or another agent can access that persistent memory and obtain the execution results.

The Mobile Agent Facility consists of the following layers:

Interface Layer, which supports acceptance of mobile requests from an agent
Agent Control Layer, which supports administration of agent migration
Agent Transfer Layer, which supports the actual movement of an agent.

The migration plan interface is provided by the mobile agent facility in order to control the migration plan of agent. An agent can move on a scheduled path defined by a migration plan object.

2.5 Multimedia Database Architecture

MMDB (Multimedia Database system) should have high extensibility to support new user-defined types when we think about future DL. This is done by plug-in extensions.

This extension enables users to integrate their own data types, which can have methods such as new index searches that can handle new media content. To implement the plug-in extension to the DB, we are developing the following interfaces and plug-ins.

SQL3 Abstract Data Type (ADT) capability that we add to the MMDB can provide definition, construction, inheritance, implied observer and mutator functions, encapsulation, polymorphism and reference. We believe that these features are necessary for managing SGML structured documents, image data, voice data, and so on.

Because of the lack of the functions, conventional SQL procedures encounter difficulty when we want to implement user-defined indexes (access methods) with high performance and high reliability. A plug-in function solves this problem and makes it possible for DBMS users to incorporate their own defined methods, such as an index to the database kernel.

To implement new indexes, we also provide new entry points to maintain, rollback, and recover an index. Since these entry points are not convenient in SQL3 ADT, the entry points are defined using new IDL (Interface Definition Language) that we developed. IDL can also define the details of interfaces used for calling from the database kernel to plug-in modules, and for generating stub modules and C language headers such as in CORBA IDL. To access internal database resources such as records, pages, BLOBs and database journals without overhead and security risks, we plan low-level access interfaces from the plug-in modules to the database kernel.

The SGML plug-in converts SGML texts into internal tree-structured data using their DTD (Document Type Definition), stores the data in a BLOB, and calls an n-gram plug-in to maintain the index. The new Contains ADT function can execute a full text search using the n-gram plug-in.

Using structural queries, the n-gram Japanese text search index plug-in can quickly search not only plain Japanese text but also SGML text.

To solve the scalability problem, our MMDB can be used as shared nothing, parallel database servers.

2.6 Application System Architecture

The application system architecture is intended to provide a unified document management platform for many kinds of DL application programs. The application system architecture is often called DL middleware because it offers a uniform access interface to DL application programs.

DL middleware has facilities for APIs, document registration, document retrieval, document version control, and compound document management.

The functional specifications of the middleware is based on the proposed DMA (Document Management Alliance) model. The major part of our specifications comes from the DMA specifications; however a structured document management facility and extended document retrieval facilities are added to the DMA specifications. Through this extension, DL middleware will be able to handle SGML documents.

The DMA specification does not discuss sophisticated retrieval facilities: such as ranking, proximity, Z39.50 queries, or structure specific retrieval. Also, practical usage will require the merging of ranked retrieval results from heterogeneous information sources. STARTS (Stanford Protocol Proposal for Internet Retrieval and Search) is one solution to these problems.

In addition to the basic SGML handling function, these features will be incorporated into the application system architecture.

3 Retrieval and Dissemination

3.1 Intellectual Information Retrieval Agent

As the number of information sources increase, it becomes more and more difficult to know the location of each source and to know which sources are appropriate for each query.

Some additional problems relating to access of information sources are:

Various protocols are used to communicate between information servers and clients: for example, telnet, http, Z39.50.
Different information search systems use different user interfaces.
Various formats are used to display search results.

Intellectual information retrieval agents will solve these problems by using competitive agent technology. A user can use such agents in a network to obtain desired information from heterogeneous information sources: for example, OPAC (Online Public Access System), DLs, museums, art galleries, and WWW directory services. This system is composed of the following three kinds of agents:

Server Agent, which uses a bidding method among library agents to select the library agents that will access the information sources.
Library Agents, which are generated for each library and which bid for the input query by referring to their knowledge database about information sources.
Access Agents, which transform a query into an access script for each information source and access the source, while obeying instructions from the Library Agent.

This system will provide the following intellectual features to users:

Uniform interface
A uniform interface absorbs the differences among access methods and provides unified operation on various kinds of information sources. Templates for access command scripts for each information source are held in the system. Because a conditional command sequence can be coded in the script, even an interactive retrieval can be executed autonomously. For an entered query, a template appropriate for the query is selected and keywords in the query are substituted for variables in the template.

Automatic selection of source to be accessed
Appropriate information sources are selected automatically according to the user's query and network conditions. A knowledge database is formed, which includes the specialty, service hours, and expected response time of each source.

Merging and unifying search results
Individual search results from various kinds of servers are modified to a uniform format, merged, and made less redundant.

3.2 Concept-Based Text Retrieval

To help end users efficiently retrieve documents relevant to their information needs, this system provides a thesaurus generator, a thesaurus browser, and a document clustering facility. The system consists of an offline subsystem, which generates thesauruses, and an online retrieval subsystem.

A thesaurus, which plays an essential role in coping with the vocabulary problem in text retrieval, is automatically generated from a text corpus. This automation will drastically reduce the cost of creating and maintaining a corpus-specific thesaurus.

The thesaurus browser is used as the front end for a text retrieval engine. It helps users navigate in the concept space of the subject field of the corpus. Therefore, they can easily articulate their information needs, and choose appropriate terms for retrieval.

The document clustering module post-processes the results of text retrieval. It extracts clusters of similar documents from the set of retrieved documents, and shows a digest of each cluster to users. The users can thereby efficiently judge the relevance of the retrieved documents.

Automatic thesaurus generation is the most essential issue of this research. We are integrating technologies for the following: extracting terms (including compound words), performing constituent analysis of compound words, acquiring co-occurring data, and analyzing term correlations.

In order to extract terms precisely from a corpus, statistical processing is combined with morphological analysis. The co-occurrence of data is the most important clue for extracting relations between terms. We extract several kinds of co-occurring data including co-occurrence in sentences, co-occurrence in windows, and syntactic co-occurrence. Correlations between terms are analyzed based on the co-occurrence of data. We calculate not only first order correlations (like mutual information) but also second order correlations (like contextual similarity). Thus we can extract various types of relations between terms: for example, synonym relations, broader term-narrower term relations, and predicate-argument relations.

Having developed a prototype thesaurus generator and browser, we are evaluating it with a large newspaper corpus. Both the quality of the thesauruses and the computing efficiency are to be improved. The document clustering module has just been designed.

3.3 Hypermedia Retrieval by 3D Visualization

VR (virtual reality) interfaces using walk-through have been developed recently. However such interfaces have some problems. For example, it takes time to make 3D models for a VR space, and the visualizing speed is very slow if complicated polygon models are used. To improve speed and operation, a new type of 3D walk-through technology that uses 2D data such as video or images is developed.

The experience of 3D hypermedia technologies which can relate each media to others in 3D VR space constructed with various media such as CG, video, images and texts, shows that a user can use the visual information to search documents effectively. On the other hand, it is reported that it is hard to exactly select a target when complicated 3D models overlapped each other. We are developing the technology that makes it possible to define the relevance of meaning for anchoring and linking in 3D VR space. If the user points out the relevant area roughly, the system can analyze the search goals of the user, the relevance of meaning, and relevance among links in 3D space; and then access the information which has high relevancy to the user's goals.

We are also developing an technology which enables a user to make any 3D shape anywhere in the 3D VR space. Even persons who do not know how to construct a simple 3D model will be able to make such a shape. This technology allows to construct a 3D anchor not only for the 3D object itself in 3D space, but also for a 3D area that has several overlapping objects or an area that has no relation to a 3D object. The user will be able to search information easily by using flexible 3D anchor.

3.4 Content-Based Retrieval System for Video

Media data such as images, video, and audio do not contain information that can be used directly for retrieval. Someone usually has to add keywords to aid retrieval of such data. It is time consuming and labor intensive to add keywords to a huge amount of data. We need a content-based retrieval system that automatically extracts information about features from the raw data.

Many conventional content-based retrieval methods such as QBIC and Jacob use features of the overall image or frame. The information about features used for retrieval is a mixture of object features and background features. Thus, with conventional methods, it is difficult to retrieve video shots that contain desired objects because the background affects the information about features.

Our content-based retrieval method for video data uses the features of each object in the video (MPEG-2). We focused our attention on moving objects in the video to simplify the object region detection. Our method uses colors, color location, and the motion direction of the object as the information about features. The information is automatically extracted from the raw data.

The information about features is calculated as follows:

From MPEG video data, extract the motion vectors and DC coefficients of color data (the average color of the macroblock) for each macroblock.
Detect the cut points in the video data.
Detect camera work, such as tilt and zoom, and compensate for it.
Detect moving object regions.
Extract the color distribution.

We have implemented two methods for detecting the moving object regions: these methods require relatively less computing time. One method uses motion vectors (which are assumed to be approximations of the optical flow) in MPEG video data, and the other uses color changes in each macroblock.

The moving object region consists of some regions that have a unique color. The object region is segmented and the values of the representative color, area, and the location of the centroid are used as the information about the object features.

The user uses a GUI to submit the user's query. The user selects a color from the color palette, and then locates the colored rectangle boxes on a color pattern input subwindow. The motion direction input subwindow is used to specify the direction of the object. The search program uses the color values, locations, and the motion direction to calculate the similarity between the specified color pattern and the data in the database. Video data that has higher similarity is displayed as the result.

3.5 Selective Information Distribution Agent

The aim of this system is to enable information distribution (dissemination) based on information content to persons and groups who the author does not know. This system enables information to be delivered only to those people who are interested in it.

This system is based on social information recommendation technology. In social information recommendations, we recommend information that other similar users have liked. The social approach has the merit that no analysis of the information is needed. A demerit is that new information that nobody has rated can not be recommended. To avoid this demerit, we devised a new method which combines content-based recommendation with the social recommendation approach.

This system includes two kind of agents: distribution agents and information server agents. Distribution agents can select appropriate information servers and registers based on the content of the information that an author released. An information server agent can select readers who are likely to be interested in the information. The distribution agent can automatically adapt to the changing of user's interests.

This system has the following merits:

Authors do not need to search for appropriate readers to distribute their contents.
Readers can receive only useful information.
Network traffic is reduced by selective distribution.

4 Contents Entry

One of the most important services expected of a DL is to capture and store the contents of existing material (for example, books, magazines, and journals), in addition to storing digitally-created contents. Of course, without such capture and storage, the user can not retrieve or reference the information. However, even if OCR (ICR) technology is fully utilized for capturing contents, manual intervention is still required: such as for scanner control, and for verification and fixing of recognition results. So entering contents becomes labor-intensive, time-consuming, and costly. Therefore making a smooth and quick migration from a huge volume of existing documents to digitized contents is very difficult.

A document can be represented as a set of various types of data whose structural level might vary as digitization progresses. The most fundamental level would be a combination of bibliographic information and information about the physical location of a document, perhaps with page images scanned in by a scanner. As the table of contents, the abstract, the full text, and images for figures and pictures are added incrementally, the contents of the digital document become richer and richer. The final structural level will be decided document by document.

Retroactive contents entry will have to be made for a great variety of documents that are printed with too many different layout styles and too many font styles. Some of these documents might be written in old languages or might be printed with old characters such as old Japanese characters (Kanji). In addition, fields of digitization interest might vary from literature and history, to science and technology, or even to law and legislation.

To improve the performance of such retroactive contents entry, it is necessary to cope with many different layouts and many font styles, recognize both current and old Japanese characters, maintain many post-processing dictionaries, etc. Also, although the basic functions are common, the user interface might have to be tuned on a case-by-case basis, depending on the types of target documents or the operational environment.

To satisfy the above requirements, we must have a framework that can integrate various component technologies as open, highly flexible parts. Therefore, a new OCR (ICR) framework has been designed to achieve prioritized, step-by-step content entry.

The framework will provide the flexibility to combine various processing objects depending on needs, to define and adopt standardized common protocols that are used to record the exchange of interim data among processing objects, and so on. The framework research is based on the design of an OCR (ICR) system which implements hierarchical object management and common message protocols among processing objects. Our research is intended to define common specifications such as for inter-object messages in the Function layer, the interim data models in the Data layer, the verification and correction interface in the Presentation layer, and batch script definition. The specifications are to be implemented in the DL system architecture. The final goal is to establish a highly-flexible framework called the "Contents Entry Framework" which will cover all tasks concerning the contents entry.

5 Summary

This paper described the technologies that we are developing in the Next Generation Digital Library Project.

When we started the project we had a policy to adopt the international standards and de fact standards as possible as we can and to extend the standards if the functions are insufficient.

According to this guideline we have added the CORBA/WWW gateway function to CORBA standard and extended DMA and SQL to handle SGML document.

In the field of no standards we are planing to present our technologies to the standard bodies. For example we have already submitted our agent architecture to FIPA (The Foundation For Intelligent Physical Agents).

Our technology development dose not cover all area that future digital library needs. We think that missing area will be covered by the other projects. For example we could use the payment mechanism that is being developed in the Electric Commerce project.

Integration technology that we are developing will make those matters possible.

we are also putting significant research efforts other than we described here: for example, property right protection mechanisms, SGML automatic conversion methods, filtering system, component integration technology and so on. A subsequent paper will cover these activities.

Acknowledgments

The author wishes to acknowledge the following persons for their contributions to writing this paper: Tetsuya Hashimoto, Shunichi Torii, Kanji Kato of Hitachi, Tatsuro Kitagawa of Nihon UNISYS, Shunji Ichiyama of NEC Kansai C & C Laboratory, Hiroyuki Kaji of Hitachi Central Research Laboratory, Satoshi Tanaka of Mitsubishi Electric, Koki Kato of Fujitsu Laboratories, Yoshi Kato of IBM Tokyo Research Laboratory.

References

Amano, T. et al.: DRS: A Workstation-Based Document Recognition System for Text Entry, IEEE Computer, Vol. 25, No. 7, pp. 67-71, 1992.
Z39.50 Maintenance Agency: ANSI/NISO Z39.50-1995(version 3), Information Retrieval Application Service Definition and Protocol Specification, 1995, http://lcweb.loc.gov/z3950/agency/1995doce.html.
Balabanovic, Marko et al.: Learning Information Retrieval Agents: Experiments with Automated Web Browsing, Proceedings of the AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Resources, March 1995.
Balabanovic, Marko: An Adaptive Web Page Recommendation Service, Proceedings of the First International Conference on Autonomous Agents, February, 1997.
Baldonado, M. et al.: Addressing heterogeneity in the networked information environment, The New Review of Information Networking, Volume 2, pp.83-102, 1996.
Chamberlin, D.: Using the New DB2: IBM's Object-Relational Database System, Morgan Kaufmann Publishers, 1996.
Document Management Alliance Technical Committee: Proposed DMA 0.9 Trial Use Specification, 1997, http://www.aiim.org/dma/vote1.0/.
Flickner, M. et al.: Query by Image and Video Content: The QBIC System, IEEE Computer, Sep., 1995.
Goldfarb, C.: The SGML Handbook, Oxford University Press, 1990.
Gravano, L. et al.: STARTS (Stanford Protocol Proposal for Internet Retrieval and Search), 1997, http://www-db.stanford.edu/~gravano/starts.html.
Jain, R. ed.: NSF Workshop on Visual Information Management Systems, SIGMOD RECORD, Vol.22, No.3, p.57-75, 1993.
Labrou, Yannis et al.: A Proposal for a new KQML Specification, TR CS-97-03, February 1997, Computer Science and Electrical Engineering Department, University of Maryland Baltimore County, Baltimore, MD 21250.
Melton, Jim. ed.: 1995 ANSI SQL3 papers SC21 N9463 through SC21 N9467, ANSI SC21.
Narasimhalu, A. D.: Multimedia databases, Multimedia Systems, Vol.4, p.226-249, 1996.
Object Management Group: CORBA and related materials, http://www.omg.org/.
Rao, R. et al.: System components for embedded information retrieval from multiple disparate information sources, Proc. of ACM Symposium on User Interface Software and Technology, Nov., 1993.
Stonebraker, M.: Object-Relational DBMSs : The Next Wave, Morgan Kaufmann Publishers, 1996.
Tsuchida, T. et al.: HyperFrame: A hypermedia framework for integration of engineering applications, ACM SIGDOC 93, Oct., 1993.
Uchiyama, T. et al.: Color Image Segmentation Using Competitive Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.16, No.12, pp. 1197-1206 , 1994.
Wellman, M. et al.: The Digital Library as Community of Information Agents, IEEE Expert, June, 1996.
Yamashita, T. et al.: A document recognition system and its applications, IBM Journal of Research and Development, Vol. 40. No. 3, pp. 341- 352, 1996.