D-Lib Magazine

Colin Webb, David Pearson, Paul Koerbin
National Library of Australia

Point of contact for this article: David Perason, dapearso@nla.gov.au

doi:10.1045/january2013-webb

Abstract

Clarifying preservation intentions is likely to be a good starting point for preservation planning for diverse digital collections. This applies both in terms of identifying what needs to be kept and what does not warrant the use of limited preservation resources, and in terms of opening up conversations about what is required in order to achieve preservation intentions. This paper describes an approach being explored by the National Library of Australia to negotiate formal and reviewable statements of 'preservation intent' for each of the digital collections in its care with those responsible for those collections. The paper looks at the relationship with the widely discussed concept of 'significant properties', as well as the other benefits that the approach is delivering. The paper also looks at the preservation intent statements for archived web collections at the NLA as an illustrative case.

Introduction

The other day, a young person asked a pointed question, 'Do you believe in significant properties, Grandpa?' The question was asked with a rather patronising smile commonly accompanying questions like, 'Do you believe in Santa Claus ... or fairies?'

I had to think for a few moments about how best to answer. 'Yes, I do, if by "believe in" you mean "think it is a useful concept likely to have practical relevance to digital preservation planning". But not if you mean "think it is the magic that will deliver all your wishes when you open your eyes". In fact, if you would like to sit down and listen, I will tell you what I think about it, and how the concept could be pragmatically applied—because that's what the National Library of Australia is exploring right now!'

Unsurprisingly, my young person decided he didn't have the time, and went off to laugh about silly old codgers who are stuck with impractical ideas that lead nowhere. I suspect he is probably still blogging on this fruitful subject.

Like most things to do with managing digital collections, effective ways of making preservation decisions are evolving. Let's be quite clear about what we mean by that statement: we (the digital preservation community) have no settled, agreed procedures for the full range of digital preservation challenges, nor even tentative plans we confidently expect to ensure adequate preservation for the next 100, 200 or 500 years. We don't know with certainty what we have to do now, let alone what will have to be done in 20, 50 or 100 years in the future. So we are necessarily speculating and making up proposals for action.

Obviously there are some contexts in which this is less true. Some managers of digital information have several decades experience of successfully looking after data collections and maintaining satisfactory access. However, we are talking about the library field—and to minimise the likelihood of offending anyone else, these comments will focus on the National Library of Australia (NLA). When it comes to preserving access to the digital collections that the NLA has corporately decided to collect, we don't know precisely what we will need to do. In fact, for some collections we barely even know the extent or detail of what we have and are collecting.

That's not quite the same as saying 'we don't know what we are doing'. We are looking for practical approaches that appear most likely to work, while recognising we are unlikely to foresee and forestall every problem. We are experienced experimenters familiar with operating speculatively.

We take the view that methods and solutions do not exist in isolation: consideration of them is meaningless without the context of what has to be achieved.

The latest manifestation of this approach to preservation planning is the development of what we at the NLA have labelled 'preservation intent statements', which we have been talking about since 2009 ¹. These statements record the organisation's intentions for specific classes of digital content expressed as a result of consultation between curators (or collection managers) and digital preservation specialists.

This concept is closely related to the idea of 'significant properties' that has been widely discussed in the preservation community for well over a decade. One source described significant properties as:

'... those aspects of the digital object which must be preserved over time in order for the digital object to remain accessible and meaningful. An institution with curatorial responsibility for digital objects cannot assert or demonstrate the continued authenticity of those objects over time, or across transformation processes, unless it can identify, measure, and declare the specific properties on which that authenticity depends. Nor can it undertake the preservation actions required to maintain access to those objects, unless it can characterise their current technical representations with sufficient detail'. ²

At least in part, the approach described in this paper is a response to the difficulties we and others have encountered in trying to apply the significant properties concept as a starting point for preservation planning ³.

We have come to a tentative conclusion that recognising and taking action to maintain significant properties will be critical, but that the concept can be more of a stumbling block than a starting block, at least in the context of our own institution. We believe reference to significant properties in preservation planning requires some prior consideration of both the purposes for which digital content has been collected and the purposes of providing preservation attention. In effect, we are asking how can we know what attributes of digital materials we need to preserve if we haven't articulated why we are preserving them?

The purpose of this paper is to explain the NLA's approach to developing agreed statements of preservation intent, as well as the apparent benefits, both generally and with specific reference to managing web archives. In doing so, the NLA is inviting dialogue with others with an interest in progressing practical preservation planning and exploring how such policy-oriented planning can be effectively systematised.

Experience suggests that stating the obvious opens a door which we can all see but to which none of us has yet been able to find the handle. Our preservation intent methodology looks like a very modest step because that's exactly what it is. But we believe it is both necessary and helpful for getting started in practical preservation planning.

Preservation Intentions and Preservation Intent Statements

The NLA's digital collections are made up of digital objects collected by all curatorial areas of the Library. These collections also include copies created by the NLA for various access or management purposes.

Like many other public institutions but perhaps unlike some other collectors of digital content, the NLA collects specifically for the purpose of providing access. While access to some content may be restricted for legal or other reasons, the Library's default collecting purpose is to support ongoing access to the nation's documentary heritage of information resources likely to have enduring value. Therefore, preservation issues and judgments usually include questions about how adequate access can be defined and how long access needs to be maintained.

The Library's preservation intent methodology is simply to engage collection curators in making explicit statements about which collection materials, and which copies of collection materials, need to remain accessible for an extended period, and which ones can be discarded when no longer in use or when access to them becomes troublesome. Curators are also asked to make broad statements clarifying what 'accessible' means by stating the priority elements that need to be re-presented in any future access for each kind of digital object type in their collections.

This is basically a call to take responsibility for deciding what happens to the collections. In the first instance, it is largely an invitation to curators to analyse the objectives of their collections and to identify why different parts of their collections were acquired. In practice, curators are being asked to consider the implications of their collection development policies and collecting decisions in terms of the realities of providing and maintaining access.

Importantly, the approach aims to accommodate the specific needs and characteristics of each collection, while looking for common ways of describing things so that patterns can be efficiently and effectively recognised and planned for. This approach accepts that users and values will almost certainly change over time. However, it assumes that useful action is more likely to arise from making the best estimates we can for the foreseeable future than it is from refusing to make any decisions about priority uses and values.

One difficulty of the approach is that those responsible for collecting often see things in terms of genres, workflows and intellectual entities, whereas collection management and preservation decisions typically deal with types of file formats and individual files. Consequently, the same file types may be perceived as having different roles and importance in different collections. While this is a challenge, an explicit aim of the approach is to bridge the distance between these different, and equally legitimate, conceptualisations of the collections.

All preservation programs, whether dealing with digital or non-digital collections, need to understand the objectives they are set up to serve. All preservation programs are confronted with the challenge of articulating how long collection objects need to be kept, and what properties are significant to their value and function and therefore need to be maintained. Digital preservation programs have highlighted the importance of these issues but they are just as real in traditional library preservation programs concerned with storing, protecting, repairing and copying other kinds of materials such as books, manuscript papers and pictures ⁴.

However, this is an especially pressing issue for digital collections because of their scale, the uncertain timeframes for taking action, the risk of deterioration not being apparent until it is too late, and the likely costs of recurring preservation treatment.

Significant Properties and Preservation Intent Statements

The NLA's Digital Preservation Program was an early and convinced advocate of a significant properties approach, recognising that preservation planning decisions about methods, quality assurance (QA) and accountability hinge on clear statements of what has to be achieved (even though there are some who argue that this is impossible) ⁵.

There are no evident orthodoxies or assumed outcomes to build upon. Curators who assume that a service provider—external or even internal—knows what 'preservation' means for diverse digital collection objects can expect to be disappointed. Responsible officers in the chain of planning, decision, action and evaluation need to educate themselves on the intentions, desired outcomes, challenges and possibilities, and then set about broadening and deepening the understanding of everyone else in the planning and management chain. In this context, we looked—and still look—to some concept of significant properties as a key to preservation planning.

However, the NLA experienced great difficulties in defining the significant properties of various kinds of digital objects when approached as an abstract, 'once and for all time' exercise, and eventually put the concept to one side until a more promising way of applying it could be found.

Initially, the approach of working with curators to identify their preservation intentions was intended to include discussion of the significant properties needing to be maintained. However, it quickly became clear that these two issues should be dealt with separately, for two main reasons:

Current Activity and Methodology

The NLA's Digital Preservation Program has recognised a need to develop a more comprehensive, sophisticated, accurate and useful understanding of what it has to manage. This began with a collection profiling exercise, which gathered broad information about general rates of collecting, what the digital collections contain, how they are organised and managed, and key issues for different collection areas ⁶. The exercise led naturally to the question of preservation intentions—how can they be teased out and how could they be expressed as formally agreed preservation intent statements?

The next step involved refining the types of questions which would be useful for the management and preservation of digital objects (see above). The various collection curators and a digital preservation specialist discussed and came to an agreement on how their collection could be described and characterised. This dialogue resulted in the drafting of 'plain language' statements in consultation with the NLA curators and collection managers. These were recorded on a wiki. Although it has taken some time to engage across the organisation, the NLA now has statements of preservation intent at some level of detail for all of its digital collections.

This activity is a key to preservation planning ⁷. Without it, we are left floundering between assumptions that every characteristic of every digital item has to be maintained forever (almost certainly an impossible expectation), and assumptions that it is good enough to store data safely and let future users worry about how to access it (almost certainly an inadequate response). The NLA believes that its responsibilities lie between these extremes, seeking as good an understanding as we can currently develop about what needs to be achieved, so we can start identifying priorities and plans for achieving preservation objectives.

In summary, this approach doesn't try to solve the deep, dark problems of digital preservation but it does give us some more information and some vital organisational engagement and recognition of shared responsibility to work with.

Benefits

The apparently modest nature of the preservation intent statements methodology belies some very significant potential benefits. In the NLA's case, a number of benefits are already emerging, including the following.

Preservation Intent Statements for Archived Web Collections

The NLA has developed preservation intent statements for all of its digital collections. However, web collections present some interesting challenges that warrant special examination because of the volume and complexity of content over the creation of which the preserving institution has no control. Moreover, preservation planning for web collections may be less well addressed globally than it is for many other kinds of digital collections. At the NLA, web collections are considered to be a high priority for preservation action, but are also likely to be among the most challenging.

Of these three collection types, for legal reasons only the PANDORA content is currently available for public access, although work is underway to make the '.gov.au' collection accessible.

Content in the NLA's web collections is obtained by harvesting copies from the web. However, the NLA's processes frequently generate archived copies which are not completely accurate representations of what was presented on the live web. In the case of our PANDORA collections, the software we use rewrites the code, so we have archived copies that are intrinsically different from the originating live web versions ⁹. On the other hand, in the NLA's experience of collections that were acquired using the Heritrix web crawler the results are more direct copies but the software often produces incomplete or inadequately rendered copies ¹⁰. Either way, we have had to face the fact that the archived copy is different in some degree from the entity on the live web.

This produces some obvious questions regarding which aspects (properties, attributes, characteristics) should be given priority for preservation, and how much change may be acceptable in an already modified representation.

A second challenge in determining preservation intentions for web collections is the combination of scale and diversity. The Library holds around five billion files or 184 terabytes of data and currently collects around one billion files or more than 40 terabytes of data each year ¹¹, so there is a huge amount of material to be considered and great diversity to be accommodated. This diversity refers to the content and the purposes for which it was collected; it also extends to the file formats and structures used. Within the PANDORA archive, for example, more than 300 file formats have been identified, while many formats have not yet been identified because of the limitations of format identification tools and the idiosyncratic ways in which many objects were originally created ¹². The scale and diversity of the existing web collections make it much more difficult to establish classes of materials than for more homogenous collections such as the NLA's Pictures or Oral History collections.

A third complicating factor for preservation planning is the complexity of the individual objects being considered. Most web pages are compound or complex objects made up of numerous digital entities: HTML, executable programs, style sheets and so on. Determining preservation intentions for each of these and accounting for their individual contributions to the overall functioning and value of the 'whole composite' (if such a thing exists) takes us back into the heart of the significant properties problem we have already referred to. We have had to look for ways of making useful and meaningful statements of intentions that account for this kind of complexity without being debilitated by it.

Even for relatively simple web objects, those responsible for articulating preservation intentions and planning action need to consider requirements for maintaining bit-level objects, content, and context such as links, possibly to a greater extent than for most other kinds of collected digital materials.

A fourth factor expected to complicate preservation action, although it is perhaps less challenging for identifying preservation intentions, is the wide variance in tolerance of tools for browsing, collecting and rendering web content. As mentioned above, many of the harvested and archived web files are idiosyncratic, leading to unpredictable changes when preservation action such as migration is applied to large numbers of files. This will seriously complicate the translation of intentions and significant properties into consistent, automated action.

Finally, the use of large-scale container file formats such as WARC ¹³, which can hold many millions of files, invites preservation planning at the container level, potentially leaving the diversity of formats, intentions and properties unaddressed within the container.

Facing these challenges, the NLA's approach to identifying and articulating its preservation intentions for its archived web collections is described in the following section. For the purposes of this paper, we are going to consider only the PANDORA archive, as we have more information about it than our other web collections and it contains our earliest content. The archive's cultural significance is reflected in its inclusion in 2004 on the UNESCO Australian Memory of the World Register ¹⁴.

Summary Preservation Intent Statement for the NLA's Selective Web Collection

What follows is the example of the Preservation Intent Statement for the NLA's Selective Web Collection.

"The NLA's selective web harvesting activity currently consists of the PANDORA Archive, which contains a selective collection of web publications and websites relating to Australia and Australians. PANDORA was established by the NLA in 1996 and contains historical online materials harvested from 1996 to the current period. Online materials, ranging from discrete publications to complete websites, are selected for inclusion in the collection with the purpose of providing long-term and persistent access to them.

The Web Archiving Section (the relevant collecting and curatorial agent) intends that:

All PANDORA digital preservation masters, including all associated metadata (currently known as the 'preservation master,' the 'display master' and the 'metadata master'), should be retained in perpetuity. All technical properties should be maintained to the fullest extent possible.
Content, connections and context are of primary importance. How it is ultimately presented to a user is a secondary consideration.
The original harvested copy, that is the 'preservation master' that represents the initial and untouched collection of files gathered by the harvest robot, is of less importance than the 'display master' which includes the results of quality assurance and curatorial work. However, as the impact of curatorial and quality assurance work upon harvested instances may not be known, the 'preservation master' (although a lesser version in terms of completeness) should be retained at least at the bit level ¹⁵.
The 'derivative copy' of the 'display master', which is created for display and access, should be maintained only for as long as it is useful. A new derivative version may be generated according to future access requirements.

The NLA understands that web archives pose several problems.

The NLA has no control over the creation of the original content and its format, standards or quality. The NLA can only currently harvest what is delivered in a single published form through a browser/server request (for example, the original data in the publisher's databases are not collected).
Current methods for collecting and rendering are also not ideal in ensuring the complete capture of all files or retaining full functionality in the version delivered from the archive.
We are only taking time specific and time limited snapshots of web content.

Therefore, the NLA accepts that what is to be preserved is not a mirror representation of the web nor of a website but, rather, a snapshot of content that was once arranged and published as a website, with only limited functionality of the original. The archived artefact is formed out of the collecting process which is inevitably lossy. Our aim is to define and control this loss. In addition, the way in which the content is collected and displayed places a significant limitation on the presentation of the archived artefact as an authentic record of the publisher's original data or of the version of that data originally published on the web.

The NLA's intention is long-term access for all users. However, over time access to certain content may be available only on-site due to technical constraints in supporting remote access to all possible operating environments.

The harvested web content, being composed of complex objects, is contained in either a compressed package (tarball) or a container file (WARC file). While the tarball retains the directory structure of the original harvested website, the WARC file may contain random collections of files plus metadata, which are managed and located by indexes.

The PANDORA collection, having begun in 1996, includes content collected through various methods with a growing legacy of inconsistency regarding Uniform Resource Identifiers (URIs), metadata and quality assurance interventions. Processes are underway to move the content to a consistent archival format (WARC), although the underlying legacy variations may not necessarily be removed in this process.

The PANDORA collection can be broadly categorised as consisting of about 75% text, 20% images (JPEG, GIF, PNG) and 5% multimedia and style elements (Java script, CSS files, Flash and so on), including linkages ¹⁶. Because of the variable nature of the collected entities (understood as the PANDORA 'title' and its archived 'instances') ranging from simple documents to complex multi-file objects, there are some parts of the collection where style elements are more important, and some parts where this is less so. Style elements are problematic from the outset since they are sometimes difficult to harvest and often remain impossible to render. Because content is harvested through a browser-type request on a server, in many cases only a subset of possible style element files are delivered (those required for the browser request). Moreover, harvesters are not able to thoroughly parse complex JavaScript which may also result in the collecting process not identifying and missing many style elements (JS, XML and so on).

Contemporary browsers are fairly tolerant for accessing both current and legacy web content. However, due to the variability of this content (collected from 1996 to the present day) and factors such as being poorly formed (no standards) can mean that viewing content as it was at the time of creation can be problematic.

The status of the visual accuracy of the harvested copy of a site has not been systematically documented (although implied in QA workflows). Thus, the look of the original may only be surmised from the content collected, the context of embedded links, tags and file types and the context of technologies known to exist at the time of harvesting. The NLA's objective is not to misrepresent the material in any way that would compromise its legal warrant to collect, preserve and make accessible the archival content. Thus particular care in retaining the integrity of the intellectual content including embedded links and domain-related image material is a priority.

This collection is currently in a state of transition in how it is stored, described and understood via technical metadata."

Practical Application

Discussion of preservation intent is not a theoretical exercise. A number of file types in the NLA's web archive are already known to present access problems, given the way the content is collected and managed in the archive. There are several problematic formats including:

This means that the NLA is facing a number of practical preservation use cases where we need to decide whether to take action, what kind of action to attempt and how to evaluate the outputs of action. Our current Preservation Intent Statements tell us if we want to keep the content accessible for an extended time period; if we want to view the content; and whether to attach importance to other access modalities such as the ability to manipulate content.

Put simply, such a preservation intent classification may lead us to assess the following action options:

The NLA is currently engaged in a large-scale effort to build workflows and infrastructure that will support a range of preservation management requirements for all of its digital collections through a multi-year Digital Library Infrastructure Replacement project. Preservation intent commitments, along with ongoing risk monitoring and other factors to be recorded in knowledge-bases, will drive workflows in the delivered preservation management systems.

While there are potential challenges in coding intent information for programmatic interrogation and use, within the next few years the NLA expects to have operational systems in place using its preservation intent information.

Conclusion

The approach described in this paper is based on a conviction that methods and solutions in digital preservation do not exist in a policy vacuum. Rather, they are only meaningfully discussed as solutions to problems that threaten or frustrate the organisation's explicit access intentions.

This approach still refers to significant properties as a critical part of the preservation planning process. We believe it puts the full identification and evaluation of significant properties into a more useful context: that is, being later in the decision-making process.

At the very least, we aim to be in a position to know whether preservation action will be needed and whether a bit-level preservation copy will be good enough when access to content is lost.

As with most preservation processes, evaluating and articulating preservation intentions is likely to be an ongoing process requiring proactive management and periodic review. Our current conception of it still requires further development but it has given the NLA a useful start in real preservation planning.

Without some kind of understanding between curators and preservers, we are doomed to recurring nasty surprises based on mismatched expectations.

Acknowledgement

The authors wish to thank Libor Coufal and Tina Mattei from the NLA for their suggestions and support. We also wish to thank Lynn Benson (NLNZ), James Doig (NAA) and Cal Lee (UNC) who provided helpful comments. We also appreciate the encouragement provided by Gina Jones (LoC), Andrea Goethals (Harvard Library), Clément Oury (BnF), Barbara Sierman (KB) and Tobias Steinke (DNB), all co-members of the IIPC PWG (International Internet Preservation Consortium, Preservation Working Group).

Notes

⁴ Virtually all preservation treatment decisions—such as deciding whether and how to rebind books, bleach stains on paper, deacidify paper artefacts and so on—involve or should involve some consideration of what gives the object value to users, so that this can be protected and preserved to the full extent possible, while minimising the extent to which the value of the object is jeopardised or compromised by preservation treatment. Similar principles apply to copying treatments such as microfilming and digitisation. Judgment about the adequacy of a particular copy as a preservation surrogate depends largely on the extent to which the valued components of the original have been captured and re-presented. Without any agreed sense of priorities regarding what is of value, preservation action decisions almost inevitably lead to conflict, as happened when library microfilming programs in some cases destroyed newspapers that had been—at least in the judgment of some users—inadequately represented by microfilm copies.

⁶ Elford, D. (2009) Collection Profiling Project. Internal National Library of Australia paper (unpublished).

⁷ See Becker, C., Kulovits, H. and Rauber, A. (2010), 'Trustworthy Preservation Panning with Plato'. ERCIM (European Research Consortium for Information and Mathematics) News, Vol.80, January 2010, pp. 24-25.

⁹ Content for the PANDORA Archive has been collected using three harvesting softwares since the program was established in 1996. At first the Harvest indexing software was used and then for a brief period the offline browser WebZIP was also used. Most of the content however, since the late 1990s, has been collected using HTTrack. As an offline browser HTTrack rewrites links as part of the harvesting process so that they function as relative links in the context of the Archive. The software also changes URLs that include dynamic database calls into a static HTML form that is functional in an 'offline' context. This means that archive URLs may have a different file name, URL structure or file extension than the original URI. Moreover, some quality assurance work can involve curators making some changes to code within the page in order for the archived instance to be as complete and functional as can be reasonably achieved. For example, in order to make FLV format videos to work some customised 'embed' code is added to replace the original script in order to deliver the video file within the context of the archived page. Finally, a post-processing program written for PANDORA re-writes external links so that they deliver a message indicating that an external linked page has not been archived, also providing a link to the live page address.

¹⁰ Heritrix does not permit the same degree of 'hand-crafted' quality assurance work as is undertaken for the PANDORA collection since the content is stored in WARC packages rather than as static HTML files in a file system structure. Consequently there is a far greater reliance on and acceptance of what Heritrix is able to collect and also on what the archival delivery system (Wayback) is to be able to render. While some quality assurance and 'patching work'—identifying and adding missing files—is undertaken on the Heritrix crawls for NLA collections (i.e. whole .au domain crawls and gov.au seed list crawls) it is not to the same degree of scrutiny that is done with the PANDORA model, not the least because Heritrix crawls are conducted for much large scale collections rather than for the discrete site harvesting for PANDORA that lends itself to more quality assurance scrutiny and manual intervention.

¹¹ The figures quoted include the Australian domain harvests of 4.8 billion files or 177 terabytes of data and the PANDORA collection of around 160 million files or seven terabyes of data. It should also be noted that the figures for PANDORA are only indicative based on the derived display copy of the archival content and when the various master copies are also counted the actual PANDORA figures may perhaps be multiplied threefold.

¹² Elford, D. (2011) PANDORA File Format Analysis. Internal National Library of Australia paper (unpublished). Elford's paper identifies idiosyncrasies such as duplicate filepaths and filenames, inconsistent mime types and extensions, formats apparently non-compliance with standards, etc.

¹⁵ The potential for confusion with the NLA's current descriptors for digital copies should be evident from this discussion.

¹⁶ An analysis of the first three Australian domain harvests (2005—2007) established these percentages of broadly aggregated mime types; see Koerbin, P. (2008), 'The Australian Web Domain Harvests: A Preliminary Quantitative Analysis of the Archive Data'. As PANDORA, although selective, may be understood as representing a sub-set of the larger domain harvest collections, the same broad percentage breakdown has been extrapolated to the PANDORA collection for the purpose of this paper.