Journal of the American Society for Information Science and Technology JASIST -- Table of Contents

Journal of the American Society for Information Science and Technology (JASIST) -- Table of Contents

Contributed by
Richard Hill
American Society for Information Science and Technology
Silver Spring, Maryland, USA
Fax: (301) 495-0810
Phone: (301) 495-0900
rhill@asis.org

VOLUME 52, NUMBER 5

[Note: below the contents of Bert Boyce's "In This Issue" has been cut into the Table of Contents.]

CONTENTS

Editorial

In this issue
Bert R. Boyce
Page 369

Research

A Noninformetric Analysis of the Relationship between Citation Age and Journal Productivity
L. Egghe
Page 371, Published online 1 February 2001

In this issue Egghe provides an explanation based upon the central limit theorem for a regularity observed by Wallace between citation age and journal productivity which implies that the there is no informetric explanation, but rather the observation of a statistical effect. He then examines the Leimkuhler curve, showing the arcs at the tail to be a mathematical rather informetric artifact. The relationship between the fraction of multinational publications of a country and the country's fractional score is also shown to be probabilistic in nature. However, the relationship between the Price index and median age requires both probabilistic and informetric explanation, and the cumulative first citation distribution seems best explained with a curve incorporating Lotka's exponent and thus has high informetric value.
Automatic Cataloguing and Searching for Retrospective Data by Use of OCR Text
Yuen-Hsien Tseng
Page 378, Published online 5 February 2001

We also include four papers concerning the automatic characterization of documents and queries. First, using a test collection of 7990 OCR scanned book pages from 500 books in four languages, and 30 queries, 15 content and 15 known item, Tseng applies variable length n-gram indexing and byte size normalization. Document terms are weighted at 1 plus the log of term frequency except for those on the first two pages of a book. These are incremented by eight not one. Each occurrence of a query term increments it weight by 1 plus the cube of the n-gram length minus1. Known item searches are limited to the first two pages of each book. Precision and recall results achieved second place in a contest entered. A similar approach has promise with Chinese text.
An Experimental Study in Automatically Categorizing Medical Documents
Berthier Ribeiro-Neto, Alberto H.F. Laender, and Luciano R.S. de Lima
Page 391, Published online 5 February 2001

In another automatic characterization paper Ribeiro-Neto, et alia, test their coding algorithm which assigns International Code of Diseases category codes to medical documents against a file of 20,569 patient records. The ICD codes are represented as a directed acyclic graph, and supplemented with acronym and synonym dictionaries for the codes. For each section of each document the acronyms and synonyms are converted to code strings and root node codes are identified. A window of document terms around each root node term is created and the longest path from the graph including these terms is extracted. These codes are assigned to the document in a ranked order by relative path length for that root. Using documents with specialists assigned ICD codes as an ideal set, 19,651 were categorized at between 70 and 80% for all recall levels, while 918 were not. However, specialists made incorrect assignments in 589 documents, and in 391 made assignments not supported by the text, but that may have been the result of additional information. In only 158 cases was the algorithm clearly incorrect.
Automatic Query Expansion via Lexical-Semantic Relationships
Jane Greenberg
Page 402, Published online 9 February 2001

Next, using 42 queries, in the form of Boolean statements with free text terminology, collected from MBA students and the ABI/Inform database, Greenberg maps against the ProQuest Controlled Vocabulary selecting those queries that contained at least one ProQuest term. These were searched in initial form, a form mapped from ProQuest, and using expansions that took all synonyms, all narrower terms, all broader terms, and all related terms. Greenberg conducted all searches on Dialog and subtracted the initial and mapped results form the other returns to gauge the expansions effectiveness. Relevance judgements were made on the basis of topical matching (aboutness) by the contributors of the queries reviewing the Union set of the responses to the query forms where each retrieved list was limited to a length 15 or less citations. If the retrieved set was under 16 all were presented, and if between 16 and 100 the top 15 ranked by similarity to the query (Dice Coefficient) were used, while if above 100 a random sample of size one hundred was used for the similarity ranking. Broader terms and Related terms each improved recall nearly 100%, while Narrower terms increased the baseline from .266 to .473. Synonyms improved from the .226 base to .369. The baseline precision of .794 was reduced to .766 by the use of synonyms, to .733 by the use of narrower terms, .544 by the use of related terms, and .595 by the use of Broader terms.
Modeling User Interest Shift Using a Bayesian Approach
Wai Lam and Javed Mostafa
Page 416, Published online 1 February 2001

In a different approach to query modification Lam and Mostafa address information filter modification in response to changing user needs. Such filters assume a stability of user need, when in fact, information needs evolve at unpredictable speeds and in unpredictable ways adding to the normal relevance assessment problems. Their passive filter stores and ranks material received for later review, building its profile from a subset of MeSH headings based on user relevance feedback assessments of documents presented. Documents are classed using a cosine similarity measure, the user provides binary interest weights for each class, and their running average is maintained as the relevance probability of the class which is used to rank all classes after the first. Positive feedback also modifies a second vector used to select the initial class, providing a means of relearning of changing interests. Since this relearning requires considerable iterations, with degraded interim results, a means of quick shift detection is needed. Using the sequence of feedback data and Bayes theorem, with associated costs of a wrong decision, the posterior probability that a shift has occurred can be computed. An upward shift will result in the new class and the old most probable class each being assigned half of the sum of their probabilities. A downward shift of the most probable class will use the user profile vector to identify the class weights to sum and distribute, since the class vector values of the other classes will be near zero. Simulation studies indicate that the system is able to recognize and correct for interest shifts.
General-Purpose Compression for Efficient Retrieval
Adam Cannane and Hugh E. Williams
Page 430, Published online 5 February 2001

The final paper in this issue is concerned with compression techniques that can speed up retrieval since disc seek and transfer cost savings can exceed decompression costs. Cannane and Williams describe an algorithm that identifies unique character strings occurring at least twice by way of multiple passes, replaces them with a reference number, and continues to form a hierarchy of longer strings that may contain references to shorter ones. The process terminates when no further duplicate strings are to be found. The representation created and an associated string dictionary allow decompression at any random access point. Using the Canterbury collection for compression experiments, and the TREC Wall Street Journal and WEBDOC files, and databases of genomic records, weather data, and geographic data, compression is found to be superior to GZIP, COMPRESS, and the Huffman coding scheme, but not as effective as BZIP2, although decompression is faster than BZIP2.

Book Reviews

Digital Capital: Harnessing the Power of Business Web, by Don Tapscott, David Ticoll, & Alex Lowy
Shana R. Ponelis
Page 438, Published online 1 February 2001

A Place at the Table: Participating in Community Building, by Kathleen de la Pena McCook
Marianne Orme
Page 439, Published online 1 February 2001

Click here to return to the D-Lib Magazine clips column.