A Computing Research Repository
Joseph Y. Halpern
Computer Science Department
Computing research relies heavily on the rapid dissemination of results. As a result, the formal process of submitting papers to journals has been augmented by other, more rapid, dissemination methods. Originally these involved printed documents, such as technical reports and conference papers. Then researchers started taking advantage of the Internet, putting papers on ftp sites and later on various web sites. But these resources were fragmented. There was no single repository to which researchers from the whole field of computing could submit reports, no single place to search for research results, and no guarantee that information would be archived at the end of a research project.
This changed in September 1998. Through a partnership of ACM, the LANL (Los Alamos National Laboratory) e-Print archive, and NCSTRL (Networked Computer Science Technical Reference Library), an online Computing Research Repository (CoRR) was established. The Repository is available to all members of the community at no charge. They can submit papers, browse and search papers currently on the Repository, and subscribe to get notification of new submissions.
In the rest of this article, I briefly describe how the Repository was set up and discuss some policy issues.
ACM, like many other publishers, has started to provide its journals in both electronic and print versions. Feedback from ACM members has welcomed this transition, but the members have made it clear that they anticipate much more fundamental change in scientific communication than simply converting traditional journals to a different medium. In May 1997, a committee was formed under the auspices of the ACM Publications Board to consider one such change: setting up an online repository for computing research. (Appendix A gives the membership of the committee, which consisted mainly of people active in digital libraries and electronic publishing.)
The committee discussed three options for the design of the Repository.
- The first option was to become part of the LANL repository. LANL started as a repository for high-energy physics eprints in 1991, several years before the introduction of the Web. It pioneered the concept of an open-access repository for fast publication of scientific research. By eliminating the time consuming and expensive process of peer review, it has transformed the dissemination of research in several disciplines. It now covers most of physics and has expanded to include repositories for nonlinear sciences, mathematics, and computation and language. The LANL archives are sometimes called a "pre-print" service, and indeed many of the eprints are subsequently published in conventional journals, but they are intended as long-term archives, with much greater permanence than typical Web sites.
As a base for a computing repository, LANL has many attractive features. Perhaps the most important is that it clearly works and works well. It now has over 75,000 eprints, is growing at the rate of about 25,000/year, handles over 70,000 transactions/day, and has over 35,000 users. Thanks to funding from the Department of Energy and the National Science Foundation, it also has a full-time staff. It is mirrored in 16 countries, has reasonable search facilities, and offers services such as email notification of new submissions of interest.
The ACM committee decided against this option primarily because the LANL interface was not open, in the sense that it did not provide an interface to which other repositories could link.
- The second option considered was to become a node in NCSTRL. NCSTRL is essentially a common interface to the technical report collections of its (currently over 100) member institutions. It has been funded by DARPA and the National Science Foundation, with most of the technical work recently being carried out at Cornell University. The most important features of NCSTRL, from our point of view, were that it was explicitly designed with an open interface and it was a computer science effort. On the other hand, NCSTRL did not have all the software necessary for running a repository.
- The third option was to build a new system from scratch. This had the obvious advantage that we could design our own system, which hopefully would have exactly the attributes we required, but had the equally obvious disadvantage that to do so would take time, money, and expertise.
We settled on a hybrid approach that combines the best features of LANL and NCSTRL, and secured the cooperation of the two groups. This allows us to use the well-tested LANL software for submission, notification, and searching, while still taking advantage of the NCSTRL architecture. The NCSTRL architecture will make it easy to build new gateways from which to access the files, with a more user-friendly interface and new features. From the point of view of the NCSTRL interface, LANL is now just a node on NCSTRL.
We anticipate that our use of an open protocol will encourage other scholarly archives to join in this framework. The result could be a global multi-disciplinary research collection that could have substantial impact on the nature of scholarly publishing.
- How do authors submit documents?
Authors send their documents to the LANL repository by email, by ftp, or by using a Web interface provided by LANL. The LANL philosophy is to automate the process as much as possible. Authors are expected to provide their papers in specified formats. They are required to provide abstracts and to classify their papers by subject area (see below). Programs at LANL automatically check submissions for completeness and send email to authors if there are any omissions.
These procedures have been refined over many years. Fine tuning may be needed for the slightly different needs of computing, but no fundamental changes are anticipated.
- How is the Repository organized?
Authors classify their submissions in two ways: the first is by choosing a subject area from a list of subject areas and the second is by choosing a primary classification from among the roughly 100 third-level headings in the 1998 ACM Computing Classification System. The ACM classification scheme provides us with a relatively stable scheme that covers all research in computing.
The subject areas are not mutually exclusive, nor do they (yet) provide complete coverage of the field. On the other hand, we hope that they better reflect the active areas of research in Computer Science. The initial list of subject areas is given in Appendix B.
We expect to add more subject areas and subdivide current subject areas depending on demand. Note that Computation and Language is one of the subject areas. The Computation and Language archive, which has been run from LANL since 1994, has been merged with CoRR; papers from that archive have been merged into the Computation and Language subject area.
- Are submissions refereed?
Submissions are not refereed. They are checked (by moderators -- see Appendix B) only for relatedness to the topic area, but not quality or novelty. Papers passing this simple check will appear in the Repository within 24 hours.
- What facilities are provided?
Authors can submit papers through the LANL interface. They can view recent submissions (or all submissions) and search through both LANL and NCSTRL.
Because CoRR provides an open interface, the repository is both a collection at LANL and also a node on NCSTRL. The two user interfaces are quite different at LANL and NCSTRL; users have the choice. Currently, the searching is only on the metadata provided by authors, which includes abstracts, titles, and authors. We are looking into providing full-text search. Readers can also request to be notified of new submissions that fall within their search profile.
- What about copyright?
Submission to the Repository does not require a transfer of copyright. Authors will continue to retain copyright when they submit (although they may have to transfer rights if they wish to publish in certain journals).
- How long will papers stay on the Repository?
The intention is to have the Repository be permanent (but see also the next question). Authors have 24 hours to withdraw or revise a paper. Updated versions of a paper can be posted at any time, but versions not removed or changed within 24 hours of submission will remain on the Repository. The version of the paper accessed will be the most recent version, but there will be pointers to the earlier versions.
- How will this affect journal publication?
In the long term, well managed repositories, such as CoRR, will clearly change the role of conventional journals. There are fields (such as medicine and chemistry) for which publishers will not publish papers that have appeared on the Web (even on an author's personal web site). This has not been the case in computer science, and is unlikely to become so. Researchers have come to expect that they will be able to make their papers available rapidly at online sites, such as CoRR, while still submitting their papers to conventional journals. ACM's interim copyright policy explicitly encourages this model. The Association has committed to allowing documents to remain on the Repository after publication in ACM journals; it will also be possible to link to the definitive version on the ACM Digital Library.
The ACM Publications Board will work to convince other publishers to adopt a similar policy. Submitting a paper to the Repository should not affect an author's ability to publish it in a journal. Some journals may require authors to remove a paper from the Repository once it appears. If this is the case, we will remove a paper at the author's request, but we hope that this will be the exception rather than the rule.
- What submission formats are acceptable?
For many years, physicists have used TeX as the standard format for research papers, because of the control that it provides for representing mathematics. Therefore, the LANL archives provide excellent support for several versions of TeX. Theoretical computer scientists also use TeX, while PostScript has been a favorite format for computing technical reports. Currently, authors can submit documents to CoRR using Tex/LaTeX/AMSTeX, HTML+GIF, PDF, or Postscript. However, if TeX (or one of its variants) is available, it is strongly preferred to Postscript or PDF. (See http://xxx.lanl.gov/help/faq/whytex for the reasons.) If an author has generated Postscript or PDF from some variant of TeX, it will be rejected in favor of the TeX source.
- What happens when electronic formats change?
Long term preservation of documents in CoRR is clearly a serious concern. We expect that software will be written to automatically convert the files in the Repository to whatever platform is current, just as there is now software to convert Postscript to PDF.
- How does ACM view the Repository?
ACM strongly supports the development of the Repository as part of a broad strategy of providing the services that its members require. However, ACM sees CoRR as a service to the whole computing community. Already there have been informal discussions with other societies about co-sponsoring the Repository or developing interlinked services. This interest comes from both the United States and Europe. Clearly, the community would be best served by the various societies collaborating, rather than having them develop competing strategies.
For its own part, ACM is planning to link CoRR with other ACM activities. For example, users of the ACM Digital Library will be able to include CoRR in their searches. ACM expects to develop policies and procedures to make it easy to submit papers from CoRR to an ACM journal. However, there is certainly no expectation that all CoRR papers will be submitted to ACM journals. Authors are free to submit their papers to any journal of their choice. And some papers on CoRR may never be published at all in the conventional research literature.
- Who is funding the Repository?
Currently, CoRR is riding on the coattails of funding provided to LANL and NCSTRL, and this should suffice for the foreseeable future. The long-run funding situation is not yet clear. The LANL archives have become so important to physics researchers that it is inconceivable that they would let the service be lost. If CoRR is equally successful, surely funds will be found to maintain it.
Providing the basic repository services does not seem to be an expensive proposition. Of course, new development can be expensive, but we should be able to take advantage of work done by other projects, so it may not be necessary to do too much development in-house. At least for now, funding does not appear to be a critical issue.
- How will this affect journals?
This is a period of turmoil in scientific publishing; nobody can predict the changes that will happen over the next few years. Librarians and researchers are concerned over the high prices of journals. As a result, much attention has been focused on alternative methods of disseminating research. The impact of CoRR and similar repositories on conventional journals is, no doubt, a question that many journal publishers are asking. If CoRR is successful, then it could have an adverse impact on journal sales. Yet, LANL has been providing eprint archives in physics since 1991 without apparent impact on conventional journals. Publishers, such as ACM, provide value-added services to authors and readers. The world will have to change dramatically before their services cease to be in demand.
- What is planned for the future?
Some plans currently being discussed include:
- constructing a gateway on the ACM web site and improving the interface at NCSTRL,
- adding a comment facility,
- adding facilities to allow automatic forward pointers to relevant papers that were not referenced in the original submission (such as papers that appeared after the submission),
- expanding the scope of the Repository so that it includes conference proceedings and, perhaps, more ephemeral information like job postings and conference listings,
- using the Repository as a testbed for building better information retrieval facilities.
CoRR is meant to be a service to the community. If it is successful -- as the Physics archive at Los Alamos has been -- then it could have a major impact on research. Indeed, the combination of Mathematics, Physics, and Computer Science already at Los Alamos could provide the nucleus for a major interdisciplinary digital library. Whether it succeeds depends on community response. Feedback and suggestions are more than welcome; send them to the author of this paper or email@example.com.
Appendix A: Committe Composition
Ronald Boisvert, NIST
James Cohoon, Virginia
Peter Denning, (former Chair of the ACM Publications Board)
Jon Doyle, MIT
Edward Fox, Virginia Tech
James Gray, Microsoft
Joseph Halpern, Cornell (Chair)
Carl Lagoze, Cornell
Bernard Lang, INRIA
Michael Lesk, Bellcore
Steve Minton, ISI
Hermann Maurer, Graz, Austria
Andrew Odlyzko, ATT
Michael O'Donnell, U. Chicago
Bernard Rous, ACM
Jerome Saltzer, MIT
Erik Sandewall, Linkoping, Sweden
Stuart Shieber, Harvard
Jeffrey Ullman, Stanford
Rebecca Wesley, Stanford
Ian Witten, Waikato, New Zealand
Appendix B: Subject Areas and Moderators
Architecture - William Waite
Artificial Intelligence - Erik Sandewall
Computation and Language - Stuart Shieber
Computational Science, Engineering, and Finance - Ronald Boisvert
Computational Complexity - Lane Hemaspaandra
Computational Geometry - Joseph O'Rourke
Computers and Society - Lorrie Cranor
Computer Vision and Pattern Recognition - Gio Wiederhold and Oscar Firschein
Cryptography and Security - Mihir Bellare
Databases - James Gray
Data Structures and Algorithms - David Karger
Digital Libraries - Michael Lesk
Discrete Mathematics - Joseph O'Rourke
Distributed Computing - Tushar Chandra
General Literature - Joseph Halpern
Graphics - Alain Fournier
Human-Computer Interaction - Terry Winograd
Information Retrieval - Bruce Croft
Learning - Thomas Dieterrich
Logic in Computer Science - Gopalan Nadathur
Mathematical Software - Ronald Boisvert
Multiagent Systems - Michael Huhns and José Vidal
Multimedia - Richard Muntz
Networking and Internet Architecture - Scott Shenker
Neural and Evolutionary Computing - Jordan Pollack
Numerical Analysis - Ronald Boisvert
Operating Systems - William Waite
Performance - Richard Muntz
Programming Languages - Nadathur Gopalan
Robotics - Bruce Donald
Software Engineering - Peter Wegner
Symbolic Computation - Richard Zippel
Other - Joe Halpern
I would like to thank Bill Arms and the members of the CoRR committee for all their efforts in setting up the repository, and their feedback on this article.
Copyright © 1998 Joseph Y. Halpern
Top | Magazine
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next Story
Comments | E-mail the Editor
D-Lib Magazine Access Terms and Conditions