UKWAC: Building the UK's First Public Web Archive

Search | Back Issues | Author Index | Title Index | Contents

D-Lib Magazine
January 2006

Volume 12 Number 1

ISSN 1082-9873

UKWAC

Building the UK's First Public Web Archive

Steve Bailey
JISC
<steve.bailey@bristol.ac.uk>

Dave Thompson
Wellcome Library
<d.thompson@wellcome.ac.uk>

	Abstract This article discusses the UK Web Archiving Consortium project, outlining the project to date as well as sharing initial lessons learned by the Consortium Partners. Introduction For many, the web has become the information source of first resort. However, despite our apparent dependence on this medium very little attention has been paid to the long-term preservation of websites. There is a danger that invaluable scholarly, cultural and scientific resources are being lost to future generations. To address this problem, a consortium of six leading UK institutions is working collaboratively on a project to develop a test-bed for selective archiving of UK websites. The organisations are: The National Archives, the British Library, the Joint Information Systems Committee (JISC), the national libraries of Scotland and Wales, and the Wellcome Library. The six organisations, ('Partners') founding the UK Web Archiving Consortium, (UKWAC) came together because of a shared interest in web archiving and a common need to address its challenges on behalf of their stakeholders. Using a modified version of the PANDAS (Pandora Digital Archiving System) software, developed by the National Library of Australia [1], Consortium Partners are archiving sites relevant to their interests and institutional expertise. This article outlines the project to date, discusses some of the experiences of web archiving in the UK and some of the initial lessons that are beginning to emerge. UK Web Archiving Consortium project background The UK Web Archiving Consortium project began with a belief that web-based material is being lost to future generations. The means to address this was outlined in two studies commissioned by two of the Consortium Partners prior to the project itself [2]. The Consortium chose to use a modified and localised version of the PANDAS software, developed by the National Library of Australia. It offered a number of useful advantages; firstly, it was the only off-the-shelf application that offered web archiving in a managed environment, and secondly, it was already developed for use in a distributed environment. The basic model of a centralised PANDAS application using distributed contributors was retained. This allowed UKWAC to quickly begin archiving with minimal effort and to concentrate on localisations that made PANDAS more suitable for use in the UK. As a consortium, the Partners agreed to meet regularly to review progress, discuss issues and plan for the future of the project. Partners meet approximately four times a year; with a geographically diverse group like this, it is not possible to have more frequent meetings. The UK Web Archiving Consortium defined four key aims at the project's inception that reflect the findings of these studies: To procure a licence from the National Library of Australia to use the PANDAS software in the UK To award a contract to an external contractor to provide the common infrastructure for the pilot project To work collaboratively in the achievement of a common searchable archive of selected web sites investigating solutions to issues such as, selection, rights management and digital preservation To evaluate the development of the collaborative infrastructure for web archiving with regards to assessing the permanence and long-term feasibility of such a collaborative enterprise By the launch of the archive in May 2005 all of the above aims had been achieved by the project, bar the evaluation exercise stated in point four, planned for the winter of 2005. The key project deliverables as set out in the UKWAC Aims and Objectives document at the commencement of the project were as follows: A common permissions form for archiving web sites A common framework of web site selection policies A fully searchable/browsable online archive of sites collected and catalogued by UKWAC members A UKWAC website and discussion list for Partners An evaluation report providing a set of recommendations on how best to proceed with the web archiving project Once again, by May 2005 all of the above aims had been achieved by the project, bar the evaluation exercise stated in point five, planned for the winter of 2005. UKWAC project methodology The process of selecting, gathering, archiving and presenting websites is achieved by each Consortium member using the PANDAS system in a clear and consistent manner, and adhering to agreed common standards and policies. The process of archiving web sites follows the basic archival principles of Selection, Acquisition, Description and Access. Whilst individual Partners select web sites to be archived, the process of building a consortium-based collection adds some additional steps to this process. Partners check that someone else has not already selected a particular web site for archiving by searching the PANDAS management application. If a site has not been archived or is not already in the process of being archived, its basic metadata is entered into the central database and 'owned' by a Consortium member who then becomes responsible for that site's life cycle management. In this way efficiency is maintained, in that site owners are contacted by only a single Partner who then becomes responsible for the relationship with that site provider and acts as a single point of contact on behalf of the Consortium. Partners 'trade' sites of potential interest with each other to ensure that sites are archived by the most appropriate agency and that valuable material is always considered for archiving. In this way each Partner has input into the content of the whole archive as well as their part of it. Individual Partners seek explicit written permission to archive sites directly from site owners before making an archival copy of a site and taking 'ownership' of the metadata record and of the archived site. The process uses a common form, letter and FAQ that ensure that every site owner contacted receives the same information about the archive and the process, and has an individual person and UKWAC Partner with whom they communicate. This helps to maintain good working relations with site owners. Whilst the PANDAS application holds a central repository of metadata about gathered web sites, individual Partners are beginning to catalogue sites for which they take archiving responsibility as part of their existing library catalogue systems. This provides two benefits: firstly, the central repository of archived sites can be both searched and browsed, and secondly, the persistent identifier URLs used to identify each archived website in the central repository can be added to local catalogues, allowing for discovery by users in local collections. This exposes the archive to a potentially wider audience as well as places archived websites in the appropriate context alongside other non-web resources relating to the same subject area. Although the appropriate foundations for future preservation activity have been established by the project at this comparatively early stage, there has yet to be any preservation intervention applied to material in the archive. Whilst the principles of individual responsibility work well for collection development and management, the centralised repository allows for future collective preservation intervention to be taken. Partners can work together to share responsibility for the preservation of all archived material by sharing the cost of that work and the risk of intervention to the material. It is in this regard that the rationale and benefits of the unique and pioneering collaboration of leading UK institutions that comprise the Consortium can best be observed. Defining characteristics of the project are the clarity and consistency of purpose and methodology adopted within the Consortium, and the way in which UKWAC communicates these aims to content creators and other interested parties. As part of our rights clearance process (see above), each content owner receives a personal approach by the Consortium and clear information about the aims and objectives of the project and the part their website may play in it. Our dialogue with content creators not only helps communicate the specific aims of the project, it also educates those responsible for the creation of future digital content about the importance and challenges of digital preservation and the role they can play in it. Exemplary and innovative digital preservation/research development The text of our main project description makes clear the breadth of digital preservation-related activity that the project is addressing. The PANDAS system has already proved itself to be an effective mechanism for harvesting websites under the direction of its creators at the National Library of Australia. The UKWAC Partners have taken advantage of this functionality and refined it to maximise the chances of successfully capturing and rendering selected UK websites. By adopting the PANDAS system, UKWAC was able to begin archiving selected UK web sites with minimal development time, cost or effort. However, the system has proved sufficiently flexible that lessons learned in the first few months of archiving have been translated into adaptations to the application that make it more suitable for use by UKWAC in the UK. Key modifications have included changing the gather behaviour of PANDAS to minimise any adverse effect on host severs of the underlying web crawl engine HTTrack [3]. The number of concurrent open connections the application can create has been reduced, as has the maximum file download transfer speed. As UKWAC archives with the express permission of rights holders, we identify our crawler activity with an ownership statement in our user agent statement. This illustrates UKWAC's willingness to work with web service providers. UKWAC has undertaken some innovative developments. PANDAS can be prone to failure when overloaded, though in a distributed environment, it can be hard to establish what the system load is at any time and whether or not there is 'free' capacity to process a gathered website. UKWAC implemented a simple 'traffic light' system to indicate when the system has free capacity. A green light indicates available capacity and that gathers can be processed, a red light indicates a high system load and that gathers should not be processed. This simple system has reduced system down time by spreading the processing load. UKWAC has also chosen to exclude search engines from the archive by using robots.txt. This is done to satisfy web site owners that there will be minimal confusion between live and archived versions of their sites. Users viewing archived sites are made aware that they are in an archive and viewing 'old' material whilst live sites can be surfaced by commonly used search engines. Policies like this ensure that UKWAC maintains an excellent working relationship with site owners. As both an innovative and a pilot project, the team is constantly evaluating the way in which the project operates, and the team reacts accordingly in the light of experience. Comparatively minor changes, such as modifications to the aggressiveness of the PANDAS web crawler HTTrack or changes to the permissions form in the light of feedback, are quickly agreed by all Partners and the changes then implemented. More fundamental issues, such as identified limitations with the PANDAS software or improvements to our web interface, are recorded and will be fed into the formal evaluation process to be conducted during the Winter 2005. The project has no qualms about acknowledging that it does not have all the answers, nor even does it as yet necessarily know all the questions. What the project has achieved, however, are the solid foundations of a system able to rise to these challenges, and the aggregation of people with the necessary skills and enthusiasm to embrace them. Web archiving is not without its difficulties It should be noted, however, that the success in realising the project's aims and objectives hides some of the very real difficulties and challenges presented by web archiving. There were few applications from which UKWAC could choose when looking for a suitable web archiving application. PANDAS was the only application available at the time that offered an end-to-end archiving workflow that provided a managed environment in which material could be gathered and managed. Yet PANDAS is not an ideal system; the current version used by UKWAC is not standards based, i.e., it uses no descriptive/cataloguing standards or authoritative subject control. Its code base is difficult to modify and maintain, making enhancements and new features difficult to integrate. The distributed architecture implemented by UKWAC restricts Partners' access to key system log files and to the code base that would make the diagnosis of problems and other application issues an easier task. The result is a dependence upon the external service provider hosting PANDAS for what could be considered routine system management. The Internet has proved to be a publishing medium that exhibits rapid development and change. The 'static' HTML pages of just a decade ago have been replaced by highly dynamic database driven environments. PANDAS requires a skilled operator to successfully archive sites of this type. Whilst experience has a role to play here PANDAS does not offer features that would make archiving these types of sites simpler and more automated. Web archiving is a new business for the UK, and there is a need for new skills by those who attempt it. This project has already highlighted the need for in-house technically skilled staff to work closely with the archiving team. Technical skills are required in the repair of 'broken' websites or failed gathers, further application development, development of new/additional system functionality, and the integration of PANDAS system and data with existing library systems. The engagement of key UK institutions with the UKWAC project ensures that these lessons are not lost and can be applied to shape the future of UK web archiving. Geographically dispersed Partners mean that there are fewer opportunities for shared knowledge and shared learning. Tips and tricks are shared using the UKWAC Wiki and mailing list, but it can be hard to write what can easily be demonstrated in a few key clicks. For Partners who have a single staff member working on web archiving a sense of isolation can be hard to overcome. UKWAC builds on international work The achievements of the project lie within the context of both past and current web archiving initiatives being conducted internationally. As already stated, the project infrastructure owes much to the pioneering work of the National Library of Australia and their PANDAS software. The selective and qualitative approach to web archiving at the heart of the PANDAS design fitted extremely well with the ethos behind the project, which itself was the product of two studies commissioned by two of the Consortium Partners prior to the project itself [2]. Some UKWAC staff came to the project with prior experience of web archiving gained in the UK, Australia and New Zealand. This experience and the use of PANDAS allowed UKWAC to begin its work very quickly. The original developers of PANDAS had faced a number of obstacles and had overcome them in successive iterations of their application. UKWAC has been able to focus on how web archiving should be applied in the UK rather than on the time consuming task of application specification and development. Both the Consortium as a whole and individual Partners continue to foster close links with other web archiving initiatives and to exchange knowledge and experiences within them. UKWAC benefits from experience gained from The National Archive's UK Government Web Archive¹ project done in association with the Internet Archive. Representatives from The British Library and Wellcome Library have participated with the International Internet Preservation Coalition² in the development of specifications and requirements for the next generation of web archiving tools. UKWACÅfs contribution to the specifications and requirements have been drawn from partners past work experience. The UKWAC archive provides clear practical benefits The archive's web interface offers an easy to use means by which users can access both information about the project and the contents of the archive itself. Archived material can be retrieved by use of a search facility or by browsing a hierarchical classification scheme of topic headings. Both methods allow users to quickly and easily locate those sites of interest to them using methods familiar to them from popular sites such as Google and Yahoo. The archive uses the open source Lucene [4] search engine and is able to search the content of individual pages within the archive. Whilst only 'simple' search is currently available, plans are underway to develop more 'advanced' search features. Access to the site and to the archive is free and open to all, and there is no charge for organisations who wish to have their material included. The archive contains a growing collection of sites that have been selected for their cultural and social importance as well as their intellectual content. As such, the Consortium believes the archive is of value to a wide spectrum of users, from academics and subject specialists to 'lay users' and the simply curious. This 'broad church' approach is important not only for serving as wide a constituency as possible, but also for helping to increase popular awareness of the relevance of digital preservation to all members of society. Co-operation between Partners has already achieved unexpected broad benefits. All Partner agencies support greater engagement with web-based materials at all levels within agencies as a result of participating in the project. UKWAC has drawn senior and professional staff together and exposed them to issues surrounding the complexity of web archiving. It has also joined technical IT staff with curatorial and archival staff, allowing each to apply their specialist skills whilst opening them to new opportunities for greater collaboration afforded by digital material and creating a new arena for those skills. This has highlighted the organisational and resourcing issues around web archiving as well as highlighted broader issues surrounding the more complex issues of digital curation. This raising of consciousness across the breadth and depth of Partner agencies can only be for the good of digital curation generally. The different aims of Partners in building a web archive have not proved an issue. National institutions can collect material of national social and cultural importance whilst more focussed, smaller Partners can concentrate on highly selective and more specific material. Each Partner's collective activity contributes to a meaningful and coherent whole. A key benefit and difference of this project to some others, is that UK institutions with relevant subject expertise undertake selective archiving of web sites within the UK. Selective archiving ensures that maximum functionality and accurate and faithful reproduction of look-and-feel can be assured. The result is a high quality archive, whose content is selected along clear and agreed lines, that is robust and authoritative. UKWAC provides long-lasting benefits to digital preservation Clearly this project, as with all digital preservation initiatives, is still at a very early stage, and the true test of its success will only come in the years and decades that lie ahead. UKWAC builds on international co-operation in its work and shares its own expertise back into the web archiving community. Key skills are being developed within the UK around the acquisition and management of web-based material, skills that will be tested when preservation intervention becomes necessary. By building international co-operation into UKWAC from the start, the complexities of digital preservation can be shared. This minimises the effort and resources required by any single institution and maximises the chances that web-based material will continue to be available into the future. There are a number of projects currently underway that are developing 'next generation' web archiving tools. Leveraging The British Library's membership of the International Internet Preservation Consortium (IIPC) [5], members of UKWAC have worked with the IIPC on the specifications and requirements for the upcoming 'Curator Tool'. The ability to participate at this level has been a direct result of the lessons learned using PANDAS and of the practical experience of archiving material from the UK web space. What UKWAC can say with assurance is that the project has laid down excellent foundations for ensuring the long-term survival of the archive and its contents. This is particularly evident in the deliberate separation between the existence of the Consortium and the technology currently used to gather, manage and present websites. The project is not reliant on the PANDAS system, but rather provides the Consortium with the level of technical independence required for long-term digital preservation. If the decision should be taken in the future to move to another web archiving solution, this could be achieved relatively easily and with minimal risk to the existing archived objects. As such, it offers the best possible chance for the long-term survival of the websites it has decided to capture and maintain. Conclusion Web archiving is not an exact science. At the present time it remains somewhat a dark art. Despite the difficulties, the UKWAC project has been an important project for digital curation and for digital preservation in the UK. It has shown that selective web archiving can be done in the UK through using a consortium. It has highlighted the fragility of web-based materials whilst offering a practicable and workable solution to address the issue of their preservation. The project also has served to develop and increase the skills base within key UK institutions and has highlighted issues and complexities of web archiving. It has further served to illustrate those issues to key decision makers in all Partner agencies. The archive is publicly available, continues to grow and is widely recognised both in the UK and overseas for its achievements and value. International co-operation and participation has ensured that UKWAC has a high profile in the international web archiving community and a high degree of credibility. By actively engaging with web archiving, international co-operation has taken very practical forms that result in both action and learning. The archive is at <http://www.webarchive.org.uk/>. Notes 1. The National Archives, UK Government Web Archive, <http://www.nationalarchives.gov.uk/preservation/webarchive/>. (Accessed 13 January 2006) 2. International Internet Preservation Coalition, <http://www.netpreserve.org>. (Accessed 13 January 2006). References [1] PANDAS (Pandora Digital Archiving System), <http://pandora.nla.gov.au/index.html>. (Accessed 11 January 2006). [2] M. Day, Collecting & preserving the world wide web: A feasibility study undertaken for the JISC and The Wellcome Trust, February 2003 <http://www.jisc.ac.uk/uploaded_documents/archiving_feasibility.pdf> and A. Charlesworth, Legal issues relating to the archiving of internet resources in the UK, EU, USA and Australia: A study undertaken for the JISC and The Wellcome Trust, February 2003, <http://www.jisc.ac.uk/uploaded_documents/archiving_legal.pdf>. (Accessed 11 January 2006) [3] HTTrack, <http://www.httrack.com>. (Accessed 11 January 2006). [4] Lucene, <http://lucene.apache.org/java/docs/>. (Accessed 11 January 2006). [5] International Internet Preservation Consortium (IIPC), <http://netpreserve.org/>. (Accessed 11 January 2006). Copyright © 2006 Steve Bailey and Dave Thompson

	Top \| Contents Search \| Author Index \| Title Index \| Back Issues Commentary \| Next article Home \| E-mail the Editor

	D-Lib Magazine Access Terms and Conditions doi:10.1045/january2006-thompson