I (Leonard Rosenthol) started a discussion with KaiHendry on twitter about the merits of PDF/A vs. HTML for the purposes of document archiving. The first thing I did was put forth a set of criteria, as used by the PDF/A committee when it first began meeting, to determine if a format is usable for "long term archival storage". Here is that list:
I would also add to this list one addition item that I believe is a given, but I want to put it down anyway
Before I continue, I should state up front that I am somewhat biased in this discussion being the PDF Standards Architect for Adobe Systems as well as the ISO Project Leader for PDF/A. I believe, however, that what follows is complete and accurate and I welcome any/all feedback!
Clearly, PDF/A addresses all seven issues - or it wouldn't have been developed as an "Electronic document file format for long-term preservation" to solve the problem in question. But let's look at each item to see how well PDF/A succeeds...
Starting with #7, PDF/A is also know formally as ISO 19005 and thus is an open standard from a highly respected international standards body. In addition, it should be noted that PDF itself is also an ISO standard (ISO 32000, to be specific) so that all uses of PDF are open and non-proprietary.
OK - with that out of the way, let's look at the other items in the list. Items #1 and 2 go together to address the need of a document format to be able to consistently, predictably (and accurately, of course) reproduce the visual representation of the document at any point in the future the same way it was seen by the author. I don't think anyone would argue that this is most certainly something that PDF does quite well. It is also well spelled out as the primary goal in the Introduction of ISO 19005:
"The primary purpose of this part of ISO 19005 is to define a file format ... which provides a mechanism for representing electronic documents in a manner that preserves their visual appearance over time, independent of the tools and systems used for creating, storing or rendering the files."
In addition, the scope statement for PDF/A says that "It is applicable to documents containing combinations of character, raster and vector data." That's an important statement in that it stipulates that the format is there to preserve any/all static 2D content be it text, raster images or vector data. Clearly, PDF and PDF/A provide for all of those content types.
One other aspect of 1 and 2 is the self-contained nature of the format - that there are no external references that may not be present at a future date (or may have changed over time) - to potentially interfere with the ability of a conforming reader (#6) to reproduce the visual representation. In fact, one of the areas where PDF/A subsets the full PDF standard is in the areas of not allowing any feature related to external references (such as non-embedded fonts, external streams, etc.)
Moving on to #3, support of rich metadata including support for standards (such as the Dublin Core) as well as the ability to use custom schemas, PDF/A has full support for the XMP specification. Through the use of XMP, authors can use not only standard schemas but also custom schemas through the use of RDF syntax, at both the document level as well on individual objects/assets (such as images) in the document.
Marginalia, as I frequently tell people when I speak on the subject, is something that many archivists actual find more useful than the document content itself. Consider a speech by a prominent figure - chances are good that they didn't write the speech, but they've certainly "marked up" the document with notes about how they intend to change it for when they give the final delivery. PDF supports a rich set of annotation types that can be applied on top of the PDF content without disturbing/modifying it in any way, and PDF/A allows all of them except for movies and sounds.
Digital Signatures have been a part of PDF for many years, and are a key feature of PDF/A to enable the ability to determine if a document has been tampered with since it was signed. Unlike the traditional detached signature model, where the signature & certificate exist in a separate file, PDF & PDF/A incorporate an enveloping solution so that the document and signature remain together. Also, the PDF and PDF/A standards clearly define not only how to sign the document but also how to verify the document.
Finally, item #6 is one that many people are not aware is most clearly part of PDF and PDF/A - the definition of a conforming reader. It's section 4.11 of ISO 32000-1 and 3.15 of ISO 19005-1. Wikipedia's entry for PDF/A says it quite clearly:
In addition, the standard places requirements on software products that read PDF/A files. A "conforming reader" must follow certain rules including following color management guidelines, using embedded fonts for rendering, and making annotation content available to users.
I think that the Wikipedia page for HTML does a pretty good job of stipulating what HTML was designed for and remains in its current standardized state from the W3C as HTML 4.
It provides a means to describe the structure of text-based information in a document—by denoting certain text as links, headings, paragraphs, lists, etc.—and to supplement that text with interactive forms, embedded images, and other objects. HTML is written in the form of "tags" that are surrounded by angle brackets. HTML can also describe, to some degree, the appearance and semantics of a document...
That definition makes it clear, and I don't believe that anyone would disagree, that HTML is about the structure and semantics of the content and not about its presentation. In fact, many people cite the flexibility of presentation of HTML to be a feature of the language rather than a problem. And for live content in an ecosystem such as the World Wide Web, it is most certainly a better solution than fixed layout formats such as PDF. In addition, HTML was also designed on the premise of hypertext (that is, of course, what the 'H' stands for) and so many elements - especially images - are referenced as separate assets (sometimes not even on the same server) rather than being directly embedded. In fact, HTML doesn't even provide for the ability to embed images (or other objects) inside of itself if you wanted. When considering the long term preservation of documents, these two design goals of HTML fly in the face of the needs to maintain a single, reliable record of the visual presentation of the document at the time it was authored.
HTML, by itself, only provides for text and raster images to be presented. The W3C, of course, has standardized on a number of extensions to its XHTML form that enable the incorporate of general vector graphics, such as SVG or WebCGM or specific use cases such as MathML but as clearly pointed out by many, XHTML is not the language of web, HTML is. Thus, unfortunately, these technologies can't be incorporated into an HTML document leaving HTML unable to represent many types of documents today.
HTML certainly had metadata in mind from the beginning (via the meta tag), and the Dublin Core folks have a specification that details how to incorporate their metadata schema into HTML. In addition, there is no question that other (non-XML-based) metadata schemas could be included in the same fashion or other methods. Of course, related to the previous discussion about self-containment, such schemas would need be fully embedded into the HTML for reliable preservation.
While there have been numerous sites that have offered the ability to add comments to web pages and even share those comments with others, each one uses their own proprietary techniques to do so - some using DHTML or AJAX, some using browser plugins, etc. There exists no standard for how to represent such things directly in the HTML language itself and thus no reliable way of archiving such information in a manner that it can be recovered in the way it was intended at some point in the future. That leaves HTML out in the cold with respect to marginalia.
As mentioned earlier, HTML is not XHTML, which means that technologies such as XML Signatures can not be applied to it. And while technically possible to apply an XML signature to XHTML, there are no clear provisions for how to store the signature (detached vs. enveloped) nor standard validation techniques of the signed material.
HTML is well known for its lack of "Conforming Reader" (or user-agent, in HTML parlance) requirements. In fact, HTML5 specifically calls this out in its abstract by saying "... special attention has been given to defining clear conformance criteria for user agents in an effort to improve interoperability".
KaiHendry posted a blog entry where he raised what he believes to be PROS and CONS with respect to HTML vs. PDF/A for archiving content. I will be posting a detailed response to it, though many of his issues are exposed as invalid in my text above. Hopefully, KaiHendry will read this document and update his accordingly (or post a new one).
Thanks for your time in reading this...
Leonard Rosenthol
PDF Standards Architect, Adobe Systems
ISO Project Leader, PDF/A (ISO 19005)