Stories

D-Lib Magazine
September 1999

Volume 5 Number 9

ISSN 1082-9873

Canonicalization: A Fundamental Tool to Facilitate Preservation and Management of Digital Information

blue line

Clifford Lynch
Executive Director
Coalition for Networked Information
cliff@cni.org

Introduction

One major school of thought concerning the preservation of digital objects has moved away from preserving the physical artifacts that may temporarily house information to focus on the preservation of the actual objects in disembodied digital form. Such thinking assumes that with proper management, a collection of bits can be carried indefinitely into the future. However, the software necessary to interpret the set of sequences of bits that comprise a digital object in its original format may cease to be available as time goes on and standards change. Periodically, it will be necessary to translate the set of bits from an older format to a newer one. The imprecise, open-ended, and ad hoc nature of this format conversion may be a major barrier to long-term preservation of digital information because costs and practices are so poorly defined and unpredictable. In particular, there are continual, imprecisely framed questions about how the capabilities of representations (formats) interact with the meaning of an object, and how translations from one format to another may alter that meaning, and thus damage that object's integrity. [1]

Within an archive, metadata accompanies and makes reference to each digital object and provides associated descriptive, structural, administrative, rights management, and other kinds of information. This metadata will also be maintained and will be migrated from format to format and standard to standard, independently of the base object it describes. But some of the metadata associated with an object is computationally bound to the specific object representation (for example, through a digital signature), creating another problem when objects and their metadata migrate asynchronously.

Key processes involved in the management of digital objects over time include the tracking of authenticity as part of provenance; maintaining the integrity of the digital object and ensuring the referential integrity of links to that object (from other objects or from metadata records); and understanding how reformatting impacts the integrity of the object. These involve both the digital object and its metadata.

This paper suggests that canonical formats and canonicalization algorithms (that is, algorithms that compute canonical representations) for various types of digital objects will help support all of these processes. In particular, it provides a language or framework for understanding the effects of format translation. I begin by looking at how these processes work without canonicalization, and examining how they break down as reformatting takes place in the management of objects over time. I then show how canonical representations and canonicalization algorithms can address these problems. Finally, the paper discusses the specifics of canonicalization for image objects as a case study.

The Ongoing Life of an Archived Digital Object

A digital object enters a repository as a set of sequences of bits; it is accompanied by a variety of metadata related to that object. With proper storage management, replication, and refreshing, this set of sequences of bits can be maintained indefinitely.

In order to provide verification of the origin of the digital object, digital signature techniques may link it to a public key that is recorded in a public key infrastructure (PKI). Typically, at the point of deposit a hash of the object would be computed using an algorithm such as MD-5 or SHA-1, and that hash would then be signed using the private key of a public/private key pair. The public key of the pair would be bound to an identity recorded in a certificate issued by a trusted certificate authority. This signed hash (which would include an identification of the hash algorithm used, a timestamp, and the certificate of the signatory) would form part of the provenance metadata for the object. At any point, the provenance could be verified by checking the certificate against the PKI, and then checking the signed hash against a new hash computed from the object itself, using the public key from the certificate.

This is satisfactory until the object has to be reformatted. Once format translation occurs, the object is represented by a different set of sequences of bits, and the signature no longer verifies. There are several unattractive alternatives. The original author can re-sign the reformatted object, but that author may no longer be available to generate a new signature, or may be unwilling to judge the suitability of the reformatted version of the object.

If the original author (who originally signed the object and owns the certificate) is not involved, the repository must begin to establish a chain of provenance (and, potentially, a chain of liability) for the object. There are several possibilities. The repository can sign the reformatted object, and also create and sign a new piece of provenance metadata that links the original author’s certificate with the reformatted object (in effect, certifying that the original author attested to authorship of a predecessor to the reformatted document). We now have only the repository’s testimony that the original author once made a verifiable authorship claim. Or, we can maintain the original signed authorship testimony and the original, unreformatted object permanently, even though it can no longer be interpreted for access purposes, but can only verify the original author’s authorship claim. The original testimony can be supplemented with a metadata certification, signed by the repository, that the original object is a predecessor to the new reformatted object (a pair of hashes for the old and new versions of the document, signed by the repository’s certificate).

It will be common to use a hash in a metadata record or a link in order to ensure referential integrity. Once the object is reformatted, all of these hashes need to be recomputed and, where appropriate, links or metadata objects need to be re-signed since the hash has changed. Similar questions arise about who signs the new links.

Finally, there is always the question of how much information is lost or corrupted in the course of reformatting. We want to be able to guarantee that for a given object the reformatted version is equivalent to the original version with regard to some specific set of object characteristics. While it would be best to prove general properties of the reformatting process (i.e., reformatting from format X to Y always preserves a given set of characteristics), such a guarantee may be very complex or impossible to verify, particularly when the reformatting is done by proprietary software (for example, Microsoft Word Version N can read documents from version N-k and resave them in the format normally used by version N). In these cases, it would be useful to have some confidence about the effects of format translation on specific individual documents -- in other words, to be able to verify that for a given document the relevant set of characteristics remains invariant when it moves from the old to the new format.

The Role of Canonical Formats and Canonicalization

Assume that we can define a canonical form for a class of digital objects that, to some extent, captures the essential characteristics of that type of object in a highly determined fashion. This form may be quite bulky and not necessarily reasonable for storing, transmitting, or manipulating objects. It’s an idealized form of the object, without regard to efficiencies. In addition, specific representations of the object may be richer than the canonical form. There may be a hierarchy of canonical forms, some of which are capable of representing much more detail or richer semantics than others (for example, ASCII text, Rich Text Format, and Word 98 format might be one such hierarchy). All the actual formats used for storing an object of a given type must be translatable to the canonical form. It is also critical that, while there may be multiple ways to represent a specific object of a given type in a working storage format, all of them should translate to the identical bit stream in the canonical form. Popular working storage formats may include lists of parameters that are not order-dependent (for example, keyword-value pairs). A canonical form must enforce a specific ordering on these parameters. Similarly, a given storage format may incorporate default values for unspecified parameters. The canonical format should be identical whether these defaults' values are obtained by omitting parameters or by explicitly specifying the parameters in question and their default values.

There is a long history of buggy software that includes irrelevant data with a stored version of a digital object. The most common cause of this is the inclusion of uninitialized fields in the stored form of the object, which just pick up whatever data happens to be in memory at the time. Recently, Microsoft issued a set of updates to Word 98 because this program was including random information as part of Word documents, creating a serious security exposure. But from the archival perspective, this problem meant that if you saved the identical document from Microsoft Word on two different occasions, you would produce two different sets of sequences of bits. Canonical formats can provide a method of compensating for these kinds of software problems. This irrelevant data isn’t part of the canonical format, and would be removed during canonicalization.

Because most storage formats are intended to be used stand-alone, they usually include some minimal level of fields for mandatory and/or optional metadata associated with the object. (In an archival environment, there is extensive, externally maintained metadata associated with each stored object -- external if for no other reason than it is more extensive than any existing or likely storage format can carry internally.) These fields are another source of problems and can be omitted from the canonical format, and thus discarded as part of canonicalization.

Many different representations of an object in a given storage format may produce the same canonical form. And translating back from the canonical form to a given storage format may not produce a bit-equivalent object.

It is worth noting that some current standards are creating a very complex environment for canonicalization. For example, UNICODE, which is the underlying character set for a growing number of current storage standards, allows multiple bit streams to represent the same stream of logical characters. Any canonicalization of objects that include data represented in UNICODE must enforce a standard choice of encoding on that data. [2, 3]

Given a canonicalization function C and a digital object D, we can deal with authenticity by asking a depositor to sign C(D) rather than D itself. Then the signature remains valid for any reformatting that is invariant under the canonicalization C. Hashes (checksums) used for referential integrity can similarly be computed over C(D) rather than the D itself, and they will remain valid across reformatting. Finally, we can verify the effects of a reformatting program R by testing that HASH(C(D)) is equivalent to HASH(C(R(D))) for an appropriate hash function such as SHA-1. This verification step allows testing for any corruption or loss of information introduced by a reformatting algorithm or program on an object-by-object basis, while filtering out irrelevant reformatting effects (such as simply rearranging fields, introducing new lossless compression algorithms, or changing ordering of order-insensitive parameters).

Thus, as objects in digital archives are reformatted, it is possible to algorithmically verify, object by object, that the reformatting process preserves object integrity with respect to a given canonicalization process. Conversely, such a test can be used to identify problem objects for manual review and evaluation as part of a reformatting effort, or simply to note algorithmically that integrity damage occurred during the reformatting.

Canonicalization also helps to make precise what is important about a class of objects. For example, comments are ignored in most programming languages and in most markup schemes. Yet a document with comments is a different set of bits than a document with the comments deleted. The definition of the canonical form will make it very clear whether the comments matter, and this will be represented in authenticity and referential integrity metadata for the object since that metadata will include the name of the canonicalization algorithm. By signing a canonical form (including an identification of the canonicalization algorithm being used), authors can make implied statements about what they view as the minimal, essential version of the work in question. As a byproduct, canonicalization helps define the space available for steagonographic constructs such as watermarks that can be used within a given storage or interchange format without damaging the integrity of the object. This is precisely the set of bits that are irrelevant in translation to the canonical form.

Finally, canonicalization is valuable in notary or registry relationships. In order to register an object with a notary or register (for example, a copyright office registry or a digital notarization service), the typical transaction is to pass a hash of the object to the service, which then signs the hash and returns the signed hash to the registrant as a receipt. The service usually also saves the hash in its own database, and may use techniques such as a published hash tree to provide more public evidence of registration of the object. By taking the hash over the canonical form of the object, the registrant can maintain some degree of insulation from the format changes that will inevitably be necessary over time in maintaining the object. Because the registry relationship persists over very long time periods (for example, copyright duration is for the life of an author plus 75 years), this insulation is important.

Image Data as a Case Study for Canonicalization

Image data is logically fairly simple, and thus has straightfoward canonical forms. One might think of defining three of these: monochrome images (which are an array of X-by-Y bits); grayscale images (which are an array of X-by-Y-by-Z bits, assuming that the levels of grayscale are linear across the Z bits); and color images (which are an array again of X-by -Y -by-Z bits, plus some information about the way that color is represented in the Z bits that make up each pixel). The canonical forms will need to include the array dimensions and specify a specific order in which the bits of the array are arranged. There are many ways in which an image can be stored: by rows or by columns; in planes; compressed or uncompressed. Different formats in use today make different choices. All of them can be mapped to the relevant canonical format. As long as the canonical form is used, integrity and authenticity can be managed independently of the idiosyncrasies of specific representations and their choices about how to store the image. As long as the compression algorithms are lossless, it makes no difference which one is used, or whether such an algorithm is used. The canonical format is independent of any compression algorithm. However, the introduction of lossy compression into a particular storage format will almost certainly produce a different canonical representation of an object.

Some image storage formats such as TIFF include header fields that can contain metadata (which will typically be represented and maintained externally in order to help with format independence and metadata management in an archival setting). These fields would not be part of the canonical form, and are discarded when the storage format is translated to canonical representation.

Defining canonical formats for images should be relatively straightforward. The only major open issues are how to represent the color space for color images and the questions involved in lossless mapping (isomorphisms) from one color space to another (particularly in the presence of digital quantization). These issues will be critical in the design of canonicalization algorithms and in the making and documentation of a set of arbitrary choices for actual representations. Making these choices should be considerably less controversial here than in standardization settings, because processing efficiency and the ability of the existing base of software and tools to understand the canonical format aren’t important.

Conclusions and Open Questions

It seems clear that canonical forms for image data would not be difficult to standardize, and that they are likely to be quite useful. The question of how to integrate color management into a canonical format for color images still needs to be explored by experts in that area.

Canonicalizations for other types of digital objects that have less clear formal models would seem to be a likely near term research area. For example, is it reasonable to think about an RTF-based or ASCII-based canonicalization for certain types of documents, or even about a hierarchy of such canonicalizations, with algorithms higher up in the hierarchy capturing more of the document’s intrinsic meaning? This is likely to be difficult. The canonical form of a document has been a subject of extensive debate and is highly contextual to the type of document in question. Are typographic characteristics (e.g., layout, fonts, typeface, and white space) significant? As documents incorporate multiple character sets and perhaps mathematical expressions, matters become ever more complex. Is it useful to introduce the notion of impermissible canonicalization as a way of rigorously identifying certain subsets of documents (for example, only those that do not contain mathematical formulae can be canonicalized in ASCII or RTF)? Canonicalization algorithms at least offer a way of talking about the "essence" of a document.

Is it reasonable to think about canonicalization for sound files at some specific level of sampling resolution, and to manage them relative to that level of resolution? This seems promising, and a similar approach may be relevant for representations of video.

Considerable work has already been done on the canonicalization of structured XML objects [4], and there is an ongoing effort involving the use of canonicalization in support of digital signatures for XML documents under the joint auspices of the World Wide Web Consortium and the Internet Engineering Task Force [5], though the motivations underlying this work are different from those under discussion here. Other structured formats for information interchange are also likely to be amenable to canonicalization.

Finally, the use of a canonicalization approach offers insights that may be useful in efforts to standardize, or at least to develop requirements as a preliminary to standardization, for provenance, authenticity, and integrity metadata and the practices that archives might use to manage such metadata over time.

Acknowledgements

I first started thinking about canonicalization in the context of an IETF BOF and a World Wide Web Consortium workshop I attended in Spring 1999 on digital signatures for XML objects [6]. The approach proposed at these meetings for signing XML documents included canonicalization algorithms as part of the signature process, and the presentations at that meeting (and subsequent mailing list discussions) were extremely illuminating. Discussions at a Spring 1999 Image Metadata Workshop hosted by the National Information Standards Organization and the Digital Library Federation [7] helped me to clarify the implications for images and image metadata. I would like to thank Bill Arms, Don Waters, John Kunze, and Cecilia Preston for their comments on an earlier draft of this paper.

References

[1] Preserving Digital Information: Final Report and Reccomendations, May 20, 1996. See <http://www.rlg.org/ArchTF/index.html>.

[2] Character Models for the World Wide Web, <http://www.w3.org/TR/WD-charmod>.

[3] Unicode Technical Report #15: Unicode Normalization Forms, <http://www.unicode.org/unicode/reports/tr15/>.

[4] See work of Worldwide Web Consortium on XML Canonicalization, for example <http://www.w3.org/TR/NOTE-xml-canonical-req>.

[5] See the work of the joint Worldwide Web Consortium and Internet Engineering Task Force working group on XML Signatures, <http://www.w3.org/Signature>.

[6] See <http://www.w3.org/Dsig/signed-MXL99/Overview.html>.

[7] See <http://www.niso.org/image.html>.

Copyright © 1999 Clifford Lynch

Top | Contents
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next story
Home | E-mail the Editor

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/september99-lynch