From at risk to open access: The Endangered Archives of the world
AbstractThe Endangered Archives Programme (EAP) contributes to the preservation of archival material that is in danger of destruction, neglect or physical deterioration worldwide. Delivered by the British Library, London, and funded by Arcadia, a charitable fund of Lisbet Rausing and Peter Baldwin, EAP supports the preservation of important at-risk photographs, documents, manuscripts, and other items from around the world; it facilitates digital capture of these items, and shares over six million images online. The Programme has accomplished this through funding over 320 projects in 80 countries around the world. The Endangered Archives Programme has been running for over ten years, initially with a mission to preserve, and more recently with an additional mission to digitize and share. This paper reflects on the shifts in preservation and digitization, process, and emphasis over that time; explores what we have learned (and what we still don’t know) about archiving in the digital age; and looks ahead to the digital archives of the future. With the shift towards providing access online to the files digitized through the Endangered Archives Programme, we recently upgraded our Web presence, moving from a proprietary system to open-source technologies like Drupal and Solr, and new community-driven frameworks including the International Image Interoperability Framework (IIIF). We will share key learnings from that process, including how we dealt with substantial amounts (over 200TB) of data and millions of images; how we managed the quality of that data; how we upgraded our systems and processes in line with best practice, including IIIF and Solr search; and how we did it all within an aggressive timeline, with eight weeks from initiating the project to launch. Our learnings from the Programme as a whole and the recent redevelopment will be relevant, useful, and actionable for anyone working with archives, digital collections, preservation, or large amounts of data.
Keywords: archives, digitization, preservation, data, open source, iiif,
The Endangered Archives Programme (EAP) contributes to the preservation of archival material that is in danger of destruction, neglect, or physical deterioration worldwide. Delivered by the British Library, London, and funded by Arcadia, a charitable fund of Lisbet Rausing and Peter Baldwin, EAP supports the preservation of important at-risk photographs, documents, manuscripts, and other items from around the world. Dr. Farquhar directs the Programme.
The Programme accomplishes this by funding projects to identify this endangered material, digitize it, secure the digital copies for the long term, and make them available online for research. Operating since 2005, the Programme has accomplished this through funding over 320 projects in 80 countries around the world (Kominko, 2015). Specifically, it facilitates digital capture of these items, and shares over six million digitized images online, along with their associated metadata.
The Programme considers grants on an annual cycle and is fully reactive. This enables it to enhance local capabilities and support applicants who have a strong connection to and awareness of the collections and their value. Critically, the original materials do not leave the country of origin; one digital copy remains with a local archival partner, and a second digital copy is deposited at the British Library.
The collections under consideration can be under a wide range of threats. They can be in poor physical condition, under attack by insects, rodents, or mold. They can be in buildings damaged by earthquakes; in archives under political threat; or at risk of flooding or fire. The digitization efforts may take place under extreme conditions far removed from a fully equipped studio, with project teams working under extreme conditions (Butterfield et al, 2018).
This paper discusses the redevelopment of the website (https://eap.bl.uk) by Cogapp that happened in late 2017.
Motivation for change
During the course of the Programme, there has been a shift from its initial goal of long-term preservation of digital facsimiles of the source archives, to dissemination: providing researchers worldwide with access to the digitized material. This meant that the systems still in use by 2017 were becoming unfit for this purpose: they could not handle the volume of data, and were becoming increasingly unreliable; the images displayed were only medium-resolution, and the website was hard to navigate, looked outdated, and was difficult to use on mobile devices.
Figure 1: the Endangered Archives Programme website prior to the redevelopment
The main goals for the redevelopment project were the following:
- Rapid delivery, with less than two months from project start to launch
- Display high-resolution, zoomable versions of all six million images
- Improve user experience, including powerful search
- Provide compatibility with the International Image Interoperability Framework (IIIF) (http://iiif.io/)
- Bring the site into the British Library brand
- Make the site accessible across devices
- Increase stability and scalability
From our experience developing similar systems, we knew that these goals could be met by providing the following three key components:
- A content-management system, for editing arbitrary pages and templating the site
- A search application server, to store the catalog and image data, and provide functionality such as free-text search and faceting
- A dynamic image server, to provide images of archival material in a variety of sizes, plus tiles for zooming
After discussion with the British Library, we selected the following three software systems to provide the above:
- Drupal (https://www.drupal.org), an open-source content management system, with easy extensibility and a great ecosystem of contributed modules
- Apache Solr (http://lapache.org/solr/), a fully-featured and performant search application server, widely used at the British Library
- IIPImage (http://iipimage.sourceforge.net), a dynamic image server with full support for the IIIF Image API (Appleby et al., 2017a)
Although we had a plan for the overall system we would use to deliver the system online, there were still a number of key challenges, in particular the following:
- Large amounts of image data: the EAP has amassed over six million high-resolution images of archival material, representing over 200Tb of uncompressed TIFF-format master images. How would we transfer these within the deadline?
- Unclear definition: although the Library had produced some prototype wireframes, these did not account for all the user-facing features required, nor the hierarchical nature of the catalogue metadata. How would we develop a usable interface that met the requirements?
- Short timescale: the stakeholder meeting to review the website was a critical aspect to the Programme’s continued funding, and was immovable. So, the redeveloped system needed to be online and available, featuring most of the final content, within two months of starting work. How would we deliver in such a short timeframe and ensure quality?
In this section we discuss how we addressed the challenges above in order to meet all of the project goals for the redeveloped EAP site.
The project had a very tight timescale. We took the call to start infrastructure work before the final system specification was agreed. In order to deliver on time, we worked closely with key stakeholders from the British Library, while developing separate parts of the system concurrently. To marshal this, we used an agile process, using a lightweight version of the Scrum methodology.
The team consisted of an experienced product owner, as well as the Programme’s lead curator and cataloger from the British Library; a development team of six people at Cogapp, as well as a Scrummaster and producer, also located at Cogapp.
We worked in single-week sprints, and made sure to include all of the Scrum procedures (planning, demo, retrospective) to maintain momentum and to refine the process. The Library’s product owner was available for each morning’s standup meeting, and the Cogapp team would also often consult them or other stakeholders ad hoc throughout the day to provide rapid decision-making.
All demo meetings were attended by all key stakeholders from the Library, either in-person or remotely, and we also recorded these for any participants who could not attend. This ensured buy-in throughout the process.
We quickly engaged with key providers and decision-makers at the Library, from a variety of different departments: cataloging, content production, IT and user-experience. It is hard to overemphasize how much this close collaboration benefited the project, and the constant communication combined with pragmatic attitude of the Library staff meant that the direction of the project was constantly refined to produce the most value within the allotted time.
Figure 2: we worked in a lightweight manner to quickly define and evolve the project goals
We needed to serve millions of images in a cost-effective manner, and so settled on using IIPImage with Amazon S3 as the backing store. Although S3 does not provide a rapid access to file data as block storage such as Elastic Block Storage (EBS) or Elastic File System (EFS), it is considerably more economical (over four times cheaper than EBS and over ten times cheaper than EFS). These cost savings are considerable when dealing with multiple terabytes of image data.
To connect the two systems, we used the open-source s3fs system (https://github.com/s3fs-fuse/s3fs-fuse) that allows S3 resources to be mounted as if they were part of the main filesystem on a Linux sever.
We also had the challenge of image conversion: the British Library holds master images in TIFF format, but these needed to be converted to the JPEG2000 format to be served by IIPImage. We also needed to log metadata for each image with Apache Solr, in order to link them to a catalog record and provide IIIF manifests (see below for details).
To provide this pipeline, we used AWS Lambda, a “serverless” system that allows the execution of arbitrary code in response to an event: in this case, a new upload to S3. When a source TIFF file is uploaded, we use the kakadu library to convert it to JPEG2000 format, and then move it to the destination S3 “bucket” that is attached to the IIPImage server via s3fs.
We also created a second Lambda function that was triggered by the arrival of this JPEG2000 file: it performs validation on the file, extracts its height and width in pixels, creates an entry in Solr with these details, and adds identifiers needed to connect the image to others (such as pages in a manuscript) or item (such as a photo album). This was done by using a set of structured filename conventions, following the pattern used by the catalogue references minted by the British Library.
Our final challenge was data transfer: the six million source images represent around 200TB of data. Even with the high-speed connectivity available to the British Library, this corresponds to an unacceptably high time to transfer the images from the Library to Amazon’s data centers (as an example, even with a constant upload speed of 100Mbps, this would take three months).
To provide a faster transfer, we recommended using Amazon Snowball appliances: a ruggedized high-speed storage array that can hold up to 80TB per unit, and which is shipped via courier to Amazon’s London data center. These proved highly effective, and the Lambda service scaled automatically to cope with the high data transfer rate when the snowball was connected to unload data at Amazon (approximately 600Gb/hour, equivalent to over 15,000 images processed per hour).
The Programme has digitized over 350,000 items, such as a manuscript or a photo album. Each of these items is described by a catalog record, organized into a hierarchy per project. All these records needed to be imported into Apache Solr to allow for user-facing features such as free-text search and search facets (categorization terms shown along with a count of the number of matches).
To achieve this, we created a two-stage process, written in Python. The fist step, known as the Harvester, takes the source data (provided by the Library as scheduled exports in CSV format), applies validation, and saves it in an intermediate XML format. The second step, called the Mill, takes this clean data and provides additional data formatting; for example, by combining certain fields for display. It then uploads these as XML documents to Solr, following the schema that we created to allow faceting on various fields such as source content type, language, script, subject, and location.
A second source of data in Solr is the content-managed information from Drupal; namely the information about each of the 330 projects, as well as other content-managed pages that describe the Programme, and the procedure for applying for grants.
The end result is to have correctly-formatted data in Solr, which can be queried using back-end code in Drupal, and displayed on the website in a consistent manner, intermingling catalogue data with content from the Drupal CMS. Finally, we also query the image data from Solr to provide all the catalogue images on the site, ranging from thumbnails displayed on search result listings, to high-resolution zoomable versions displayed for each page.
Figure 3: faceted search on the main search results page.
A key requirement was to implement full support for both the IIIF Image API (Appleby et al., 2017a) and the IIIF Presentation API (Appleby et al., 2017b). The Image API is provided directly by IIPImage as described in the section Zoomable images above, while the presentation API required custom coding. This programing, implemented as a Drupal module, queries Solr to obtain information from the CMS, catalog information, and image data, and then outputs the results as a JSON-format IIIF manifest.
This manifest is then used to provide the information for each group of digitized images that is rendered on the site using the open-source Universal Viewer (https://universalviewer.io/), customized using its configuration mechanism to align with its use on other British Library websites.
Figure 4: Universal Viewer displaying a IIIF manifest on the EAP website.
The fact that all rendering of archive images and metadata on the site is carried out via IIIF manifests means that the same information can automatically be displayed by other IIIF viewers such as Mirador (http://projectmirador.org/), and we implemented the drag-and-drop pattern on the IIIF logo visible on each archive page, to allow this to happen easily.
Figure 5: Mirador displaying the same IIIF manifest as above.
Branding and mobile support
The British Library brand is supported online by their global experience language (GEL). The GEL offers a consistent, user-tested way to use the Library’s numerous online sites and systems.
Bringing the Endangered Archives Programme site into the GEL involved updating the existing GEL code to work in a responsive manner, based on media queries, as opposed to the server-side browser-detection in place on other British Library sites. With these updates to the GEL templates and the CSS, the site worked across both mobile and desktop browsers, and provided a “look and feel” consistent with the Library’s brand.
This capability is especially important for these materials, which have strong global interest.
Figure 6: the new site is consistent with the British Library’s design and branding
Stability and scalability
The site runs tried-and-tested versions of mature software, in the form of Drupal, Solr and IIPImage, and is therefor extremely stable.
We deployed the key system components on AWS fronted by Elastic Load Balancers. This means that, if demand on resources is high, it is straightforward to increase the capacity of the system using horizontal scaling, i.e. by adding more instances of a given system behind the load-balancer, and having traffic shared amongst them.
The new system architecture also means that it is easy to extend; for example, adding user-facing features such as a map of all project locations (thanks to geocoordinates stored in Drupal). Or, back-end features such as automating the update of catalog metadata by switching from an ad-hoc export via CSV to a scheduled harvest using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).
The Endangered Archives Programme has been operating for over a decade now, and during this time, has seen a change in emphasis to support global online access and reuse.
The recent redevelopment project by Cogapp, working in close conjunction with stakeholders from the British Library, used a fusion of three core systems (flexible CMS, powerful search application, and dynamic image delivery) to bring the online representation of the Programme up-to-date, and provide a platform for growth. In doing so, we also demonstrated the suitability of serverless technologies such as AWS Lambda for the bulk processing of image data, in order to rapidly update the millions of images preserved by the EAP into a format suitable for online use, now and in the future.
Appleby, M., T. Crane, R. Sanderson, J. Stroop, & S. Warner (2017a). IIIF Image API specification 2.1. Consulted January 25th, 2018. Available http://iiif.io/api/image/2.1/
Appleby, M., T. Crane, R. Sanderson, J. Stroop, & S. Warner (2017b). IIIF Presentation API specification 2.1. Consulted January 25th, 2018. Available http://iiif.io/api/presentation/2.1
Kominko, M. (ed.) (2015), From Dust to Digital: Ten Years of the Endangered Archives Programme. Cambridge: Open Book Publishers. Available https://www.openbookpublishers.com/product/283
Butterworth, J., A. Pearson, P. Sutherland & A. Farquhar (2018, forthcoming). Remote Capture: Digitising Documentary Heritage in Challenging Locations. Cambridge: Open Book Publishers.
Roddis, Tristan and Farquhar, Adam. "From at risk to open access: The Endangered Archives of the world." MW18: MW 2018. Published February 4, 2018. Consulted .