Web Archiving Considerations for Digital Art History Projects

March 17, 2021

The disappointment of navigating to a favorite website or an important online resource only to find that it has evaporated from the live web is one that we are all too familiar with. Websites disappear, move to new URLs, and embedded content is continually changed, resulting in an impermanence that can pose a threat to art historical scholarship. For publishers who are willing to invest time, expense, and substantial effort in creating web-based digital art history resources, why then let them go defunct or disappear from the web without preserving them (and the full experience of them as a website) for future researchers?

The practice of web archiving offers a solution to the ephemerality of web-based materials. Web archiving involves the use of a web crawler to harvest all of the files and related metadata of a website and saving them as a WARC file (or Web ARChive file format, an ISO standard). WARC files can then be rendered/viewed with a playback mechanism, such as the Internet Archive’s Wayback Machine, offering the experience of the website as it existed online at a specific date in time. Web archiving can be as simple as saving a page to the Wayback Machine to ensure a permanent link persists as a citation for locating the page in the future, or it can be done at scale in a more programmatic way, to include: curation and collection development, harvesting and quality assurance, metadata and description, and long-term storage and preservation.

Over the past seven years at the New York Art Resources Consortium (NYARC) we have established a web archiving program for collecting, preserving, and making publicly accessible web-based resources specific to art and art history. Our web archive collections are primarily built and managed with the Internet Archive’s Archive-It subscription service. We additionally utilize the web archiving tools offered by Webrecorder and the Conifer service at Rhizome. The sites that are selected for inclusion in our web archives are of relevance to scholars of art history and are in alignment with our own institutional collecting missions. Our subject-based collections include: art resources, artists’ websites, auction house websites, born-digital catalogues raisonnés, New York City gallery and art dealer websites, and websites for scholarship related to the restitution of lost or looted art. To date we have archived nearly 8 terabytes of content, inclusive of NYARC’s own institutional websites.

Screenshot of an archived version of the London Gallery Project website, part of the NYARC Art Resources web archive collection.

When browsing or conducting research with web archive collections, users may find certain files or portions of a particular website not included in the archived version of the site. While at times it is a curatorial decision to exclude content from the archive, most often it is excluded because it cannot be adequately captured by a web crawler or fully rendered in playback by the presently available web harvesting and replay tools. Dynamic content may pose steep challenges to capture/playback and content that displays only through a search action would not be accessible to a web crawler, nor would harvesting a full database that is generally searched by a visitor to the site. Additionally, formats such as Flash and JavaScript at times cannot be reliably archived, thus it’s best to include standard HTML links and stable URLs for all portions of the website.

Designing your website with accessibility standards in mind will not only make it more usable and discoverable, but it will ensure it is more readily preservable by a web crawler. Including a site map is an important step to guaranteeing that the web crawler can identify all of the URLs within your website. In regards to displaying content that would generally be searched for or sorted by a site user, providing an option to “view all entries” will allow a web crawler to archive all content items within a collection. Adding a robots exclusion standard (robots.txt) to portions of a website will block web crawlers — thus it’s best to remove them from the sections of a site most significant to preserve.

For more information on web archiving and steps that can be taken to develop a more easily preservable website for your digital art history initiative, the following resources from experienced web archiving practitioners provide helpful guidance:

Sumitra Duncan | Head, Web Archiving at the New York Art Resources Consortium