|   | Journal of the American Society for Information Science and Technology
(JASIST) -- Table of Contents
Contributed byRichard Hill
 American Society for Information Science and Technology
 Silver Spring, Maryland, USA
 Fax: (301) 495-0810
 Phone: (301) 495-0900
 rhill@asis.org
   
 VOLUME 52, NUMBER 5   [Note: below the contents of Bert Boyce's "In This Issue" has been cut into the 
Table of Contents.] CONTENTS  Editorial  In this issue 
Bert R. Boyce
 Page  369
 Research    
A Noninformetric Analysis of the Relationship between Citation Age and Journal 
Productivity L. Egghe
 Page  371,  Published online 1 February 2001
   In this issue Egghe provides an explanation based upon the central limit 
theorem for a regularity observed by Wallace between citation age and journal 
productivity which implies that the there is no informetric explanation, but 
rather the observation of a statistical effect. He then examines the Leimkuhler 
curve, showing the arcs at the tail to be a mathematical rather informetric artifact.
The relationship between the fraction of multinational publications of a country 
and the country's fractional score is also shown to be probabilistic in nature. 
However, the relationship between the Price index and median age requires both 
probabilistic and informetric explanation, and the cumulative first citation 
distribution seems best explained with a curve incorporating Lotka's exponent 
and thus has high informetric value.    Automatic Cataloguing and Searching for Retrospective Data by Use of OCR 
Text Yuen-Hsien Tseng
 Page 378,  Published online 5 February 2001
  We also include four papers concerning the automatic characterization of documents
and queries. First, using a test collection of 7990 OCR scanned book pages from 
500 books in four languages, and 30 queries, 15 content and 15 known item, Tseng 
applies variable length n-gram indexing and byte size normalization. Document 
terms are weighted at 1 plus the log of term frequency except for those on the 
first two pages of a book. These are incremented by eight not one. Each occurrence 
of a query term increments it weight by 1 plus the cube of the n-gram length 
minus1. Known item searches are limited to the first two pages of each book. 
Precision and recall results achieved second place in a contest entered. A similar 
approach has promise with Chinese text.     An Experimental Study in Automatically Categorizing Medical Documents Berthier Ribeiro-Neto, Alberto H.F. Laender, and Luciano R.S. de Lima
 Page  391,     Published online 5 February 2001
   In another automatic characterization paper Ribeiro-Neto, et alia, test their 
coding algorithm which assigns International Code of Diseases category codes 
to medical documents against a file of 20,569 patient records. The ICD codes 
are represented as a directed acyclic graph, and supplemented with acronym and 
synonym dictionaries for the codes. For each section of each document the acronyms 
and synonyms are converted to code strings and root node codes are identified. 
A window of document terms around each root node term is created and the longest 
path from the graph including these terms is extracted. These codes are assigned 
to the document in a ranked order by relative path length for that root. 
   Using documents with specialists assigned ICD codes as an ideal set, 19,651 
were categorized at between 70 and 80% for all recall levels, while 918 were 
not. However, specialists made incorrect assignments in 589 documents, and in 
391 made assignments not supported by the text, but that may have been the result 
of additional information. In only 158 cases was the algorithm clearly incorrect.    Automatic Query Expansion via Lexical-Semantic RelationshipsJane Greenberg
 Page 402,   Published online 9 February 2001
   Next, using 42 queries, in the form of Boolean statements with free text
terminology, 
collected from MBA students and the ABI/Inform database, Greenberg maps against 
the ProQuest Controlled Vocabulary selecting those queries that contained at 
least one ProQuest term. These were searched in initial form, a form mapped from 
ProQuest, and using expansions that took all synonyms, all narrower terms, all 
broader terms, and all related terms. Greenberg conducted all searches on Dialog 
and subtracted the initial and mapped results form the other returns to gauge 
the expansions effectiveness. Relevance judgements were made on the basis of 
topical matching (aboutness) by the contributors of the queries reviewing the 
Union set of the responses to the query forms where each retrieved list was limited 
to a length 15 or less citations. If the retrieved set was under 16 all were 
presented, and if between 16 and 100 the top 15 ranked by similarity to the query 
(Dice Coefficient) were used, while if above 100 a random sample of size one 
hundred was used for the similarity ranking. Broader terms and Related terms 
each improved recall nearly 100%, while Narrower terms increased the baseline 
from .266 to .473. Synonyms improved from the .226 base to .369. The baseline 
precision of .794 was reduced to .766 by the use of synonyms, to .733 by the 
use of narrower terms, .544 by the use of related terms, and .595 by the use 
of Broader terms.     Modeling User Interest Shift Using a Bayesian Approach Wai Lam and Javed Mostafa
 Page 416,  Published online 1 February 2001
  In a different approach to query modification Lam and Mostafa address information 
filter modification in response to changing user needs. Such filters assume a 
stability of user need, when in fact, information needs evolve at unpredictable 
speeds and in unpredictable ways adding to the normal relevance assessment problems. 
Their passive filter stores and ranks material received for later review, building 
its profile from a subset of MeSH headings based on user relevance feedback
assessments 
of documents presented. Documents are classed using a cosine similarity measure, 
the user provides binary interest weights for each class, and their running average 
is maintained as the relevance probability of the class which is used to rank 
all classes after the first. Positive feedback also modifies a second vector 
used to select the initial class, providing a means of relearning of changing 
interests. Since this relearning requires considerable iterations, with degraded 
interim results, a means of quick shift detection is needed. Using the sequence 
of feedback data and Bayes theorem, with associated costs of a wrong decision, 
the posterior probability that a shift has occurred can be computed. An upward 
shift will result in the new class and the old most probable class each being 
assigned half of the sum of their probabilities. A downward shift of the most 
probable class will use the user profile vector to identify the class weights 
to sum and distribute, since the class vector values of the other classes will 
be near zero. Simulation studies indicate that the system is able to recognize 
and correct for interest shifts.    General-Purpose Compression for Efficient Retrieval Adam Cannane and Hugh E. Williams
 Page 430,  Published online 5 February 2001
   The final paper in this issue is concerned with compression techniques that 
can speed up retrieval since disc seek and transfer cost savings can exceed
decompression 
costs. Cannane and Williams describe an algorithm that identifies unique character 
strings occurring at least twice by way of multiple passes, replaces them with 
a reference number, and continues to form a hierarchy of longer strings that 
may contain references to shorter ones. The process terminates when no further 
duplicate strings are to be found. The representation created and an associated 
string dictionary allow decompression at any random access point. Using the
Canterbury 
collection for compression experiments, and the TREC Wall Street Journal and 
WEBDOC files, and databases of genomic records, weather data, and geographic 
data, compression is found to be superior to GZIP, COMPRESS, and the Huffman 
coding scheme, but not as effective as BZIP2, although decompression is faster 
than BZIP2.  Book Reviews    Digital Capital: Harnessing the Power of Business Web, by Don Tapscott, David 
Ticoll, & Alex Lowy Shana R. Ponelis
 Page  438,    Published online 1 February 2001
     A Place at the Table: Participating in Community Building, by Kathleen de 
la Pena McCook Marianne Orme
 Page 439,  Published online 1 February 2001
 |