Roberta Y. Rand
USDA Global Change Data and Information Management, Coordinator
National Agricultural Library
Beltsville, MD 20705
D-Lib Magazine, August 1995
The United States federal agencies of the Global Change Data Management Working Group (GCDMWG) are implementing several pilot projects intended to broaden the scope of the Global Change Data and Information System (GCDIS) (http://www.gcdis.usgcrp.gov/). One of these is the Global Change Assisted Search for Knowledge (GC-ASK) prototype system (http://ask.gcdis.usgcrp.gov:8080/). GC-ASK is the second of a series of incremental developments to provide the global change community with a comprehensive set of tools to access and exploit diverse, distributed data systems over the Internet. It will enable users with different skills, needs, and access methods, to obtain relevant global change information from these data systems by using natural language queries and a common user interface.
The United States Global Change Research Program (USGCRP) was established to observe, and record what is happening to the Earth's environment; to understand why changes are occurring; to improve predictions of what will happen; to understand the consequences of change; and to develop the capabilities for assessing change. In the Global Change Data and Information Management Program Plan, published in 1992, the participating U.S. federal agencies:
The Global Change Data and Information System Implementation Plan builds on the Program Plan to define the implementation of a Global Change Data and Information System (GCDIS). The Implementation Plan states that for a better understanding of the environment, the U.S. federal agencies will:
The GC- ASK prototype system is a one-year project (November 1994 - November 1995) that will demonstrate the GCDIS mission by using currently available, off-the shelf client-server software that adheres to common standards. The pilot project will demonstrate a symbiotic linking of existing heterogeneous data and information resources in a distributed evolutionary environment. It will accommodate the individual needs of all users by providing access to these services at multiple skill levels via multiple paths to enable the users to find the information they want and need. The natural language principles that underlie the software permit the text-retrieval engine to retrieve documents which match not only the precise terms used in the query, but also documents that contain terms and concepts related to those used in the query.
Issues. A fairly complex system architecture is required for the GC-ASK system to handle the wide variety of issues associated with a virtual system approach. For example, the data and information provided by the many participating agencies are diverse in content and format; the audiences needing access to these data and information vary widely in their ability to effectively utilize search tools and to access data and information.
The GC-ASK system will contain three modules: client module, assisted search module, and data collection module.
The client or user portion of the system is cross-platform and will consist of a personal computer (PC) or workstation, running under a graphical user interface (GUI) that allows users of all levels (researchers, educators, policy makers, and K-12 students) easy and accurate access to global change data and information. These users will be able to access a GUI using a natural language query and will be able to:
Users will have several search options. They will be able to search local data (present on their PC or workstation) as well as remote databases connected via dial-up or the Internet. Expert searchers who know where the data they want resides can query the appropriate server directly, without assistance. Finally, the GUIs will allow access to general reference information available on a local CD-ROM.
A Query Scheduler in the client portion of the system will enable searches to be performed in real time or on a delayed basis and routed to the user. User interfaces will be made available to all clients (in order to accommodate all Internet users) and will allow functionality consistent across platforms. Additionally, a character-based interface is available in the assisted search module for use over the Internet with any VT 100 style terminal or terminal emulator.
The Assisted Search Module (ASK) will be made up of multiple components. The ASK server or servers will be the main system (or systems) for assisting and executing searches and collecting results. Multiple assisted search servers may be required based on user load, throughput considerations, network size and number of data systems. Assisted search systems will handle a hierarchically-structured guide with multiple points of access for identifying relevant data and information sets and including inventory information and library and information center resources. These modules will be dedicated to the assisted search process and will process any user query in the assisted mode. The multiple components of the assisted search module are discussed below.
The NL Search Engine. The NL Search engine is the heart of the system. To leverage the latest developments in computer technology, the NL search engine combines the best of both statistical and concept-based searching. Statistical searching measures criteria such as the number of occurrences of search terms or the proximity of those terms to another within a document.
Concept-Based Searching. Concept-based searching enhances the process of adding user knowledge and expertise; users can choose concept definitions from a knowledge base, improving accuracy though additional contextual evidence. Users will be able to search requests in the form of Boolean queries, plain English queries, or by geographical parameters. The search engine will be able to enhance the search request with related terms and concepts, which may be refined by the user, prior to search execution.
Extendable Knowledge Bases. Although the ASK search engine will include built-in knowledge bases consisting of dictionaries/thesauri, it will also allow at the local level for extensions, with the capability of easy modification and loading of new terms. The diversity of vocabularies among data sources also makes it mandatory that users be able to identify and utilize terminology unique to a particular discipline or agency, thus enabling concept-based searching from a particular user perspective.
Internally Consistent Data Model. This function allows a consistent data model for display of information across disparate documents from multiple data systems. The normalization function will be designed to take input from a specified source, parse that source and based on a set of rules map the source to a HTML reprsentation. The normalization function will concatenate the HTML representation into a single HTML instance for the GUI to display.
Smart Query Scheduler and Smart Query Servers. The nature of the distributed environment to be handled by the assisted search module requires that several processes be controlled by this module. Because clients can query the data systems in real time or on a delayed basis and can query over multiple data systems in multiple locations simultaneously, queries must be scheduled and routed for maximum efficiency and speed. This requires a client/server architecture in which several processes (transparent to users) are controlled by the assisted search module and its interfaces to other remote servers. One or more Smart Query Servers will exist for each data system to be searched; the server will actually execute the search (through an interface to the search engine) and return the results to a Client Handler. When a client (user) makes a search request, the Client Handler first requests an available Query Server from the Smart Query Scheduler. The Scheduler will wait (if necessary) for an available Query Server and then allocate the server to the Client Handler.
Client Handler. The Client Handler will connect to the Query Server and initiate the search over the requested data system. When the search is complete, the Query Server will notify the Scheduler that it is available for additional searches. Each Query Server must be able to handle multiple simultaneous requests. The Smart Query Scheduler must also monitor all Query Servers so that no one server is over utilized.
Expanded Query Generator and Results Synthesizer. The assisted search module will enable users to query on data and information cataloged using various search engines and their associated loading and formatting mechanisms. To handle the multiplicity of formats and still take advantage of the built-in knowledge base and relevancy ranking provided by the NL search engine, an Expanded Query Generator will translate a user query into one or more queries recognized by the data systems being searched. After the data systems are searched and results are returned, a Results Synthesizer will combine the returned documents into a single, ranked list for the user. The Synthesizer will allow the NL Search Engine to apply relevancy ranking uniformly, even though the returned documents may have been retrieved from diverse data systems and engines. As part of this component, interfaces to a variety of search and storage standards (such as WAIS, World Wide Web, and others) will be provided.
Assisted Search Metadata. A future addition is metadata (or data about data) that will be maintained on the assisted search module. This MetaGuide will consist of information such as participating Internet addresses and IDs, server names and locations, search engines, available knowledge bases, user profile and routing data and the subject matter contained in various databases. The MetaGuide directory will serve as a global navigation tool and will be provided in the form of a main directory, as well as in an alphabetical index. Data for the MetaGuide directory will be gathered from all participating providers of global change data and information.
GIS Option. A geographical information system (GIS) is required to facilitate searching for documents that meet geographic parameters. As demonstrated in a previous GCDIS pilot project, the semantic network structure can already identify hierarchal references (such as a part of) to political boundaries (as in Arizona is part of the United States which is part of North America) as well as specific latitude and longitude references. An existing GIS will be available in the GC-ASK module.
Help Desk. After the delivery of the last deliverable, a Help Desk will be available to assist users who have difficulty accessing the system, locating Internet address information or databases to be searched, formulating queries, etc. Users can access the Help Desk via electronic messages over the Internet, via telephone, etc.
Data Collection Module. Multiple government agencies and industry participants will be providing access to global change data and information. This data and information will typically reside on a local server at the participating agency and will be accessible via the assisted search module (or directly by expert users). Individual providers will determine what format and how their data and information will be maintained and updated.
The Project Plan for the GC-ASK prototype system consists of a four-phased design and implementation, with maintenance scheduled from November 1995 to November 1996. The primary objective of each phase is described below:
Design Phase -Identify the level of effort required to build the GC-ASK prototype system. Specific major objectives include evaluating off-the-shelf vs. customized interfaces, researching optimum system configurations and developing a design specification, modifying the proposed implementation plan as necessary to reflect that specification.
Implementation Phase - Build and test an operational version of the GC- ASK system, which can be used as a prototype by users to search specific sources.
Maintenance Phase - Provide for the staffing of the Help Desk and enable the expansion of the system to include more users and data sources.
An essential element of the program is maximum involvement of users throughout the development process. To this end, a Users Group was formed early in the program, which consists of experts representing each of the user classes. The Users Group evaluates the user aspects of each prototype, reports on their findings and results, and provides direct input to the development process.
Following Prototype four will be comprehensive performance testing and evaluation. The architecture is designed to enable the project to continue beyond the current prototype development. Participating government agencies and private organizations can license the GC-ASK server technology and implement their own information systems, which then becomes connected to the GC-ASK network.
The industry team implementing GC-ASK includes ConQuest Software, Inc., the prime contractor and implementor of its dictionary-based natural language search engine (ConQuest has recently merged with Excalibur Technologies Corp.); E-Systems, Inc., Garland Division of Dallas TX, as system integrator and provider of imagery and GIS technology; Infrastructures for Information, Toronto, Canada, for its flexible cross-document viewing technology; WAIS Inc., Menlo Park, CA, for its Internet publishing protocols, and Genasys II, Inc., Ft. Collins, CO. This combination of contractors and technologies will provide a cutting edge prototype of the next generation of on-line information system.
Roberta Y. Rand, Department of Agriculture, National Agricultural Library (http://www.nalusda.gov) - Internet: email@example.com
Kevin Schaefer, NASA HQ - Internet:kschaefer@mtpe. nasa.hq.gov
1. Some of the material in this paper is from "Our Changing Planet: The FY 1991 Research Plan of the U. S. Global Change Research Program (Washington, DC: Committee on Earth and Environmental Services, October 1990); and "The U.S. Global Change Data and Information System Implementation Plan (Washington, DC: Committee on Environment and Natural Resources Research, 1994).
2. Library Hi Tech, Special Double Issue-Global Change and the Role of Libraries," Volume 13 Number 1-2, Consecutive Issue #49-50, 1995.
3. U.S. Global Change Research Program, Policy Statement on Data Management for Global Change Research, July 1991. DOE/EP-0001P.