Table of Contents
The use of content standards and the documentation of standards used is a vital strategy for both access to digital collections and their preservation. The use of metadata standards ensures the discoverability and use of digital collections across multiple contexts, and consistent adherence to format standards extends the life cycle of digital files as technologies evolve. The use of format and metadata standards are equally important aspects of content best practice in building digital collections.
Many use the term "metadata" to refer solely to digital objects, making it seem a different undertaking from "cataloging." The creation of metadata is not a conceptually new activity; the cataloging of physical holdings is the creation of metadata. There are additional needs in terms of metadata that document the technology associated with the objects that are used to manage large, disparate, and possibly distributed collections of digital files, but much of what is needed is much the same as for physical objects - authoritative description of the title, creator(s), subject(s), and geographical coverage, and documentation of holdings, location, and status.
Metadata may be either embedded in the digital object, or exist externally to the digital object. Metadata is generally separated into four broad categories:
Descriptive metadata: Information that documents the intellectual content and context.
Administrative metadata: Information regarding creation date of the digital resource, copyright, use rights, etc.
Technical metadata: Information that documents the attributes of an object, such as the digital capture process, media format, file size, and pixel dimensions. In some schemes, technical metadata is part of the administrative metadata.
Structural metadata: Information that describes the relationships between files that might make up an object, or the relationship between objects that make up a larger conceptual whole.
As an example, for a digital image, one might have the following metadata recorded:
Descriptive:
title: Bear Dance, Preparing for a Bear Hunt
artist: Catlin, George
creation date: 1835-1837
medium: oil
physical description: 19 5/8 x 27 1/2 in.
type: artwork ; painting
subject (AAT): landscapes (representations)
subject (AAT): Native Americans
subject (LCSH): Indians of North America
tribe: Western Sioux ; Lakota
identifier (SAAM): nnnn.nnn.nnn
credit: Smithsonian American Art Museum, Gift of Mrs. Joseph Harrison, Jr.
Administrative:
persistent identifier: uva-lib:nnnnn
creator: University of Virginia Library
publisher: University of Virginia Library
access: Publicly accessible
copyright: copyright 2004, by the Rector and Visitors of the University of Virginia
creation date: 20040527
Technical:
file size: 4.1 Mb
mime type: image/tiff
compression: none
color space: RGB color
image width: 1475
image length: 961
source X: 72
source Y: 72
bits per sample: 8
samples per pixel: 3
Structural: (documents the location to access the various files that make up the object)
thumbnail: http://server/directory/object_thumbnail
screen size: http://server/directory/object_screen
max size: http://server/directory/object_max
descriptive metadata: http://server/directory/object_descmeta
administrative metadata: http://server/directory/object_adminmeta
Encoding standards are codifications of the practice of organizing data. This ranges from data dictionaries that describe local fields or elements and the standards for their use, to international standards for the creation of shareable metadata. Best practices for encoding range from a consistent record structure, to the consistent use of the data structure in a record, to the consistent use of controlled vocabularies in those records, and down to the consistent encoding of the characters in those records.
Examples of encoding standards:
Structure of records:
Categories for the Description of Works of Art (CDWA): http://www.getty.edu/research/conducting_research/standards/cdwa/
Data Documentation initiative: http://www.icpsr.umich.edu/DDI/
Dublin Core: http://dublincore.org/
Encoded Archival Description (EAD): http://www.loc.gov/ead/
Federal Geographic Data Committee (FGDC): http://www.fgdc.gov/standards/standards.html
MARC: http://www.loc.gov/marc/
Metadata Encoding and Transmission Standard (METS): http://www.loc.gov/standards/mets/
Metadata Object Description Schema (MODS): http://www.loc.gov/standards/mods/
Text Encoding Initiative (TEI): http://www.tei-c.org/
PREMIS Preservation metadata: http://www.loc.gov/standards/premis/
VRA Core: http://www.vraweb.org/vracore3.htm
Cataloging standards and use of data fields:
Anglo-American Cataloging Rules (AACR2): http://www.aacr2.org/
Cataloging Cultural Objects (CCO): http://www.vraweb.org/ccoweb/
Character encoding:
Unicode: http://www.unicode.org/
Content data for some elements, such as the subject element, may be selected from a "controlled vocabulary," a limited set of consistently used and carefully defined terms. Using terminology from a controlled vocabulary ensures consistency and can improve the quality of search results, and may also reduce the likelihood of spelling errors when recording metadata. The description of each element indicates whether content should be selected from a controlled vocabulary, if possible.
Examples of controlled vocabularies:
Web Thesaurus Compendium: http://www.ipsi.fraunhofer.de/~lutes/thesoecd.html (a compendium of controlled vocabularies)
Index to Organism Names: http://www.biosis.org.uk/ion/search.htm
Getty Research Institute Data Standards and Guidelines: http://www.getty.edu/research/conducting_research/standards/
Getty Art and Architecture Thesaurus: http://shiva.pub.getty.edu/aat_browser/
Getty Thesaurus of Graphic Names: http://shiva.pub.getty.edu/tgn_browser/
Getty Union List of Artists Names: http://shiva.pub.getty.edu/ulan_browser/
ICONCLASS: http://www.iconclass.nl/
Library of Congress Authorities: http://authorities.loc.gov/
Medical Subject Headings (MeSH): http://www.nlm.nih.gov/mesh/meshhome.html
NASA Thesaurus : http://www.sti.nasa.gov/thesfrm1.htm
Nomenclatural Glossary for Zoology: http://scientific.thomson.com/support/products/zr/zoological-glossary/
The Revised Nomenclature for Museum Cataloging, A Revised and Expanded Version of Robert G. Chenhall's System for Classifying Man-Made Objects by James R. Blackaby, Patricia Greeno, and the Nomenclature Committee. Published by American Association for State and Local History, 1988.
SPECTRUM Terminology: http://www.mda.org.uk/spectrum-terminology/
Thesaurus of Graphic Materials I: http://lcweb.loc.gov/rr/print/tgm1/toc.html
Thesaurus of Graphic Materials II: http://lcweb.loc.gov/rr/print/tgm2/
The long-term utility of digital media files remains a major unknown. As data files are created and collected, their creators and stewards must take pains to ensure that the content will be accessible for as long as possible. Using standard file formats and documenting their use is the best guarantee of that longevity. Research the de facto standards in use in the community, decide what formats are most appropriate for your collections, use the formats consistently, document your content creation practices, and include information about the digitization process in the technical metadata that accompanies your digital objects. Consistent and well-documented practices will mean that future migrations to new formats (and preservation migration of formats is a certainty) are more likely to be successful.
There is much discussion in the community of the importance of using non-proprietary file formats. It is crucial to the long-term survival of digital content that it be created using file formats that can be migrated into new formats when necessary. Proprietary formats can often be migrated, but files in proprietary formats may be inaccessible once the software that created them has disappeared from the market. Migrations from non-proprietary or open standard formats can more likely be carried out without the cooperation of a software vendor, since the formats are publicly defined. This is not to say that you must avoid all content in proprietary formats or never use in it your operations - formats like Word or PDF (until fully replaced by PDF/A) are often unavoidable when accepting deposited content or necessary to ensure broad usability of files. When proprietary formats are unavoidable, documenting their use and avoiding proprietary features that complicate data migration are the best practices to follow. This article will not endorse or recommend any particular formats.
The following sites offer discussions and recommendations of technical formats for different types of digital content:
Collaborative Digitization Project Digital Audio Best Practices: http://www.cdpheritage.org/digital/audio/documents/CDPDABP_1-2.pdf
Digital Information Preservation: http://www.rlg.org/ArchTF/tfadi.index.htm
DLF Benchmark for Faithful Digital Reproductions of Monographs and Serials: http://www.diglib.org/standards/bmarkfin.htm
Guidelines for Computer File Types, Interchange Formats and Information Standards: http://www.collectionscanada.ca/06/0612/061204_e.html
JPEG 2000 in Archives and Libraries: http://j2karclib.info/
Library of Congress, Digital Formats for Content Reproductions: http://lcweb2.loc.gov/ammem/formats.html
Motion Picture Experts Group (MPEG): http://www.chiariglione.org/mpeg/
NINCH Guide to Good Practice in the Digital Representation and Management of Cultural Heritage Materials: http://www.nyu.edu/its/humanities/ninchguide/
NISO Framework of Guidance for Building Good Digital Collections: http://www.niso.org/framework/Framework2.html
Digital Media File Types: Survey of Common Formats (NIST): http://www.itl.nist.gov/div895/isis/filetypes.html
PADI Format Standards: http://www.nla.gov.au/padi/topics/452.html
PADI Formats and media: http://www.nla.gov.au/padi/topics/44.html
Society of Motion Picture and television Engineers (SMPTE) Standards: http://www.smpte.org/smpte_store/standards/
Sustainability of Digital Formats: Planning for Library of Congress Collections: http://www.digitalpreservation.gov/formats/index.shtml
VRA Guides to Quality in Visual Resource Imaging: http://www.rlg.org/legacy/visguides/
Interoperability is the ability of two or more information systems to exchange and use information. Metadata interoperability is highly dependent upon the ability to map identical or similar elements of data structures. The simplest strategy for interoperability is for each system to employ similar data structures and similar or identical encoding semantics and controlled vocabularies, as with MARC records, AACR2, and controlled vocabularies in the creation of library records. In the reality of the digital library world, metadata is created by a myriad of repositories using a myriad of data structures, semantics, and vocabularies and taxonomies (community-based or local), all valid for their context and environment.
The most difficult task in metadata interoperability is semantic and taxonomic compatibility. Full metadata interoperability across diverse systems with diverse content is impossible. Achieving even limited interoperability between systems requires coordination, most often achieved through a common mapping between the fields or elements that each system uses to organize metadata. A "map" documents the correspondences between similar data fields or elements in different systems or standards. When exchanging information, each system can recognize the mapping between fields or elements and translate between them.
Best practices demand consistent utilization of encoding rules and practices (e.g. AACR2 or CCO [Cataloging Cultural Objects]), and standardized vocabularies (e.g., the Getty Art & Architecture Thesaurus, Library of Congress Subject Headings, or those developed by specialized communities), as well as documentation of the vocabularies in use for each field or element. Digital library programs should develop and maintain "data dictionaries" that document all fields or elements in all local systems and the semantics, taxonomies, and vocabularies used in all instances. Also vital is documentation between local practice and other standards and systems called a "crosswalk," documenting the mappings between local practice and MARC, Dublin Core, VRA Core (Visual Resources Association), Text Encoding Initiative (TEI), and Encoded Archival Description (EAD).
There are interoperability concerns not just for metadata, but for the files that make up the content of the digital objects. Some institutions use TIFF, some use JPEG 2000, and some use Mr. Sid. TEI has passed through a number of stages of development, and some institutions use TEIxLite with others use TEI P4. Formats change over time. TIFF has had many versions. There is a new TEI profile under development. JPEG 2000 is a relatively new standard. PDF/A is a developing open version of PDF. There is no right or wrong; each institution picks what is appropriate for its needs at the time of collection building.
There are many best practices to keep in mind for format interoperability:
Review the standards used in the community. Broad acceptance of a format generally translates to broad support for a format across many systems.
Use format standards consistently. Do not use TIFF for one project and JPEG 2000 for another project unless there is some functionality required by a project that translates to the need for a different format.
Document your use. Record which formats you use for all types of collections, as well as the details of your use. Do not just record that you use TIFF - record that you capture 32-bit color masters at 600 dpi, saving the files in the TIFF format. As another example, do not just record that you use TEI - record any local standards for the use of the TEI elements or local extensions that you may use, which will effect the ability of other systems to parse your TEI files.
Include technical metadata in the metadata for your objects. This improves the likelihood that other systems will be able to recognize and use your files. At a minimum, document the MIME type in a standardized way.
Validate, refresh and migrate. Periodically review files on live disk and offline media to confirm that the files are uncorrupted and usable. Refresh offline storage media regularly. If you are switching to a new format - such as from Mr. Sid to JPEG 2000 - migrate your existing collections in addition to changing your new production. This will simplify your delivery systems and simplify any future format migrations.
While these practices are essential in any local digital collection building initiative, the need is magnified in the sharing or aggregation of metadata and collections. Many institutions make the metadata representing their digital collections available for harvesting and aggregation through the Open Archives Initiative Protocol for Metadata Harvesting (http://www.openarchives.org/). When records from disparate institutions are aggregated in services such as OAIster (http://oaister.umdl.umich.edu/o/oaister/) or the Collaborative Digitization Project (http://www.cdpheritage.org/), it is a boon to the user to have so many resources available through a single search interface. It is also a challenge to the user because some institutions use MARC and some use ContentDM and some use EAD and some use TEI, and so on. How have those institutions recorded their metadata, and how have they mapped it? Will a search for "ocean" find all the applicable content? While every institution should use the standards that are most appropriate for their collections, all institutions should pay attention to community practices in the use of record formats, encoding practices, controlled vocabularies, and mapping and crosswalking between standards to better ensure that their collections are shareable. One truth has become self-evident in this area: the richer the metadata - descriptive, administrative, technical and structural - and the more standard the formats used, the better the chances of interoperability.
The following sites link to professional organizations, projects, and reports that, in addition to those in the section above, provide a wealth of experience and information for any institution planning projects to create, manage, and deliver collections of digital content.
Association of Moving Image Archivists: http://www.amianet.org/publication/publication.html
Association for Recorded Sound Collections (ARSC): http://www.arsc-audio.org/
Collaborative Digitization Project Digital Toolbox: http://www.cdpheritage.org/digital/index.cfm
Digital Library Federation: http://www.diglib.org/
International Research on Permanent Authentic Records in Electronic Systems (InterPARES): http://www.interpares.org/
Library of Congress Digital Collections and Programs: http://www.loc.gov/library/libarch-digital.html
Museum Computer Network, Standards: http://www.mcn.edu/resources/sigstandards/
National Information Standards Organization: http:www.niso.org/
Northeast Document Conservation Center Handbook for Digital Projects: http://www.nedcc.org/digital/dighome.htm
OCLC Research: http://www.oclc.org/research/default.htm
RLG Guides and Tools: http://www.rlg.org/en/page.php?Page_ID=555
Society of American Archivists: http://www.archivists.org/
Strategies for Building Digital Collections: http://www.clir.org/pubs/reports/pub101/contents.html
Technical Advisory Service for Images: http://www.tasi.ac.uk/advice/advice.html
Visual Resources Association, Resources: http://www.vraweb.org/resources.html