Monday, April 16, 2018

Specifications for Analog Video Digitization: Examining Community Practices

Overview

Here at the Bentley Historical Library, our inventory includes substantial amounts of
moving image materials in a wide variety of obsolete formats. Additionally, we have
prioritized a number of significant moving image collections for digitization in the near
future. In order to ensure preservation and access over the long-term, we are
seeking to formalize our digitization strategy by establishing a contract with a vendor.
To initiate this process, we are writing a Request for Proposals (RFP) for moving
image digitization. One important part of the RFP is outlining detailed specifications
for the transfer of analog materials to digital files. Our main goals for developing
specifications for the RFP are to comply with community best practices as well as
meet the needs of the Library, our technical infrastructure, and our researchers.

There is currently no consensus in the library and archives community on a target
preservation format for analog video. In order to better understand current practices,
I began collecting and comparing specifications across institutions. After a thorough
online search for analog video specifications for digitization, I discovered documentation
from fourteen organizations, including university, public, and state libraries and archives,
that have made their specifications openly available. To make review and comparison of
the specification easier, common factors, such as wrapper and codec information, were
extracted and recorded in a spreadsheet. The following findings are synthesized from
the aggregated specifications.


Findings

Specification Documentation. Presentation of specifications and terminology varied
widely across institutions, however, most of the specifications themselves were very
similar. Trends emerged such as common wrapper and codec pairings, color space,
and chroma subsampling. Many institutions provided somewhat limited or incomplete
specifications compared to others. For example, a number of institutions did not include
any specifications for accompanying audio. Some specifications included multiple options
for a particular specification often citing one as preferred and another as acceptable.

Most Common Wrapper and Codec. Quicktime (.mov)/Uncompressed (v210) is the most
commonly used wrapper and codec pairing followed by Matroska (.mkv)/ffv1 (see table and
charts below). In the past few years a growing number of institutions have adopted Matroska
(.mkv)/ffv1. This trend may be due to increased community support including active tool
development and standardization efforts as well as storage considerations.


Wrapper
Codec
Number of Institutions Using Pairing
Quicktime (.mov)
Uncompressed (v210)
8
Matroska (.mkv)
ffv1
5
Other pairings include: AVI/ffv1, AVI/JPEG2000, MXF/ffv1, MXF/Uncompressed or JPEG2000


File Size by Codec. The codec selected for a digitization project has a significant impact on
the amount of data produced. Some institutions choose a lossless compression, such as
JPEG2000 or ffv1 to reduce data while maintaining a faithful copy of analog source material.
Based on a blind sampling, Indiana University found that using the Uncompressed (v210)
codec produced files of approx. 100 GB per hour of content. The ffv1 codec averaged 33.2 GB
per hour. Choosing a lossless compression was expected to reduce the amount of data
produced for their entire project by approximately 65%.

Forms response chart. Question title: Wrapper/File format. Number of responses: 14 responses.

Forms response chart. Question title: Video compression/Codec. Number of responses: 14 responses.

Beyond Wrapper and Codec

Frame Size and Aspect Ratio. Most institutions specified an aspect ratio of 4:3 with a
frame size of 720 x 486 for standard definition video. Some varying specifications include
a frame size of 640 x 480 (SD) or 486 x 720 and an aspect ratio derived from the source
material (“Same as source”).

Color Space and Bit Depth. Color space was always YUV/YCbCr with 4:2:2 chroma
subsampling. YUV and YCbCr are often used interchangeably, but YUV is an analog
encoding whereas YCbCr is digital. Although YCbCr is technically more accurate, YUV
is an industry accepted term and understood to mean YCbCr when referring to digital
video. The requirement for bit depth was most often 10-bit, however, a few institutions
allowed for 8- or 10-bit.

Frame Rate and Scanning. Frame rate was most commonly maintained from the analog
source, however some specified 29.97 or 30 fps. One organization required 60 fps based
on the use of interlaced scanning. Three specifications required interlaced scanning while
three others maintain original scanning. Only one organization chose progressive scanning.

Timecode and Closed Captioning. When included, specifications for timecode and
closed captioning were always to maintain the original. For time code, additional instructions
were sometimes included for adding synthetic time coding when no original exists.

Audio Specifications. When specifications for accompanying audio were included, most
institutions required files to be: Uncompressed PCM, 48 kHz, 24-bit, with channels same as
source. Some variations include one organization specifying 2 channel audio and another
allowing for 16- or 24-bit resolution.

I hope these findings will be helpful to others who might be in the process of writing an
RFP or selecting a preservation format for their video materials. It is important to mention
that this sampling is by no means exhaustive. I have since discovered additional specifications,
however, we found this sample size sufficient for our comparison. Feel free to get in touch with
questions or for more information about this work.

In an upcoming post, Melissa Hernández-Durán, Lead Archivist for Audio Visual Curation,
and I will write about our experiences developing metadata requirements for moving image
digitization. Stay tuned!

Friday, February 2, 2018

Conservation Treatment Tiers: An Aid to Prioritization

Staff members often need to know how much time a repair might take in order to prioritize work or to give an estimate to a donor who would like to sponsor a project. In 2017 the staff in the Bentley Conservation Lab devised a more comprehensible method of estimating repair time. A three-tier system didn’t seem detailed enough so we started with four and tweaked it over the next couple months until we settled on our five-tier system.
                Our Tier One category (less than one-hour repair) responds to requests for a quick fix-- examples below. Tier Five designates projects that are very involved and will take more than ten hours. There is a lot of area between “less than one hour” and “more than ten” so we broke it down into three more tiers that fit with our most common types of projects.
                The legend (below right) hangs in our lab for easy reference.  The bar graph is useful in reporting to our administration (through the Business Intelligence Committee) about the types of projects we handle and how long they take. It doesn’t report ongoing work, just the projects that have been completed each month.


Graph for Business Intelligence Report and legend for Conservation Lab

Tier 1: < 1 Hr.

                 A Tier One item might be popped in between longer projects or at the end of a day when starting a larger project doesn’t seem efficient. A Tier One is often done immediately because it is needed by the digitization lab which makes it high priority. Another example is when a researcher in our reading room requests an item and Reference staff finds the item in such need of repair that it might be damaged in handling.  Some examples are mending small tears, ironing wrinkles, and removing sewing or staples.


Ironing on a quick mend

Tier 2: 1 - 3 Hrs.

               Tier Two covers slightly more time-consuming repairs such as making portfolio-style boxes or encapsulating scrapbook leaves when they are too fragile to be rebound but must be protected.


Scrapbook pages encapsulated in polyester film


Inside view of a portfolio style box


A finished portfolio style box

Tier 3: 3 - 5 Hrs.

               Examples of Tier Three jobs are mending maps or drawings, depending on the extent of the tears and number of items. The photos show tears in a map and previous tape repair that needs to be removed from fragile tracing paper.


Damaged drawings on tracing paper drawing


Multiple types of tape on tracing paper drawing


Map torn and separated at the fold

Tier 4: 5 - 10 Hrs.

               This Ann Arbor Film festival document was a hand-made scroll with many types of tapes and adhesives, definitely Tier Four, as were the founding documents of the University of Michigan Philosophical Society. The book was in pieces and so important to the university’s history that it was given a ¾ leather binding.


12 foot Ann Arbor Film Festival collaged scroll


2 images of the scroll, detailing tape, adhesive and loose items




University of Michigan Philosophical Society founding documents, before treatment


Inside detail


Detail of rusty staples and worn signature folds


Finishing the ¾ leather binding


















Tier 5: > 10 Hrs. 

         Tier Five projects are those that take over ten hours and we try to estimate just how many that might be. In this case we had a scrapbook of extremely acidic and crumbling paper with newspaper articles that were fragile, wrinkled and torn. We photographed each page before removing the items then used those photos for proper placement on the new pages.  The new scrapbook was larger so the articles could be displayed without overlapping.


Scrapbook, before treatment


Scrapbook pages were numbered and photographed for identification




The photos were used to match fragments of articles for proper placement


Reconstructed articles and polyester film pockets on new scrapbook leaves



Original Cover


Finished scrapbook- at long last!

               Our treatment tiers are serving their purpose and mesh well with the Bentley's system of prioritization. (Hint: it involves a COLORFUL spreadsheet!) More about that in our next rip-roaring installment.  

Wednesday, September 20, 2017

Archivematica Implementation: A Retrospective

First, some exciting news: It's official! We've fully implemented Archivematica here at the University of Michigan Bentley Historical Library and, as of August 31, 2017, we've used it as part of the end-to-end digital archives workflow we developed during the Mellon-funded ArchivesSpace-Archivematica-DSpace Workflow Integration project to deposit our first "production" AIPs into DeepBlue, our repository for digital preservation and access.

Go ahead, check them out! (And these too!) And there was much rejoicing (yaaaaaaaay)...

In this post, we'll reflect a bit on what's happened since our last status report and look forward to a brave new digital archives world (at least here at the Bentley).

Major Milestones


Archivematica is a web- and standards-based, open-source application which allows your institution to preserve long-term access to trustworthy, authentic and reliable digital content.

The Mellon grant officially concluded nearly a year ago on October 31st, 2016. At the time, we announced that we had achieved each of the three major development objectives for the project:
  • the creation of a new Appraisal and Arrangement tab in Archivematica that will permit archivists to characterize, review, arrange, and describe digital archives;
  • the integration of Archivematica and ArchivesSpace; and
  • the integration of Archivematica and DSpace.

Archivematica 1.6 was officially released on March 16, 2017. Dubbed "the Nancy Deromedi release" in memory of Nancy Deromedi, former Associate Director for Curation here at the Bentley, whose vision helped shape defining features of the release, this release contained the features listed above whose development we sponsored as part of this work, as well as some work by MLibrary's own Aaron Elkiss to "drastically cut down" the number of files that need to be indexed by removing empty BulkExtractor logs. (Up until this point, indexing in Archivematica had been a huge problem for us, particularly for transfers with lots of files).

Even with this release, however, we still weren't quite ready to fully adopt Archivematica and go "live" with the ArchivesSpace-Archivematica-DSpace workflow (even though we had been making extensive use of Archivematica's Backlog feature and the `automation-tools`).

Fix One Bug, Two More Shall Take Its Place

Even before Archivematica 1.6 was officially released, however, we had identified a number of additional bug fixes (and new features) that were blocking our full adoption and implementation of Archivematica.

Issues Addressed by Artefactual

We opened another contract with Artefactual (the lead developers of Archivematica) to address a number of these issues, some of which are listed below:
  • Handles were not being written back to the File Version field of ArchivesSpace's Digital Object module. Ultimately, this meant that links out to digital content were not making it back to our public finding aids.
  • We were unable to drag-and-drop all files from the Backlog pane. This was essential to being able to associate digital content with its description.
  • It was difficult to identify the location of files in the Backlog pane when they had been singled out in the Examine Contents and File List panes. Archivists thus had a hard time locating files (e.g., after they had tagged them, say, as having sensitive data) in their original order.
  • Files whose formats were not able to be identified were being included in facets for other file formats in the Analysis pane, making file format characterization a bit unwieldy.
  • Required (at least for us) metadata fields were not being written to the DSpace Item (although they were being written to the METS file inside the AIP). This had implications for searching and browsing in DeepBlue. This is particularly problematic for ensuring that online researchers we get from search engines that take people directly to digital content in DeepBlue (rather than through our finding aids) have the context they need to understand that digital content.
  • Scrolling down the File List pane made all the File List buttons disappear, which led to poor usability of the functionality enabled by the buttons (e.g., creating a new component of description, finalizing an arrangement, etc.).
  • We wanted to the option to package AIPs in the .ZIP archive format (in addition to .7Z). We prefer the .ZIP format because it's more familiar than .7Z to the majority of our researchers.
  • The date facets in the File List pane were not functional and, in any case, last modified dates weren't showing up.

All of these issues (except the last one, but more on that later) were incorporated in the 1.6.1 release of Archivematica, which came out on August 1, 2017. This release also included some work by our own Dallas Pillen to fix a bug that occurred when trying to run a SIP through Archivematica's Ingest microservices when that SIP (coming from the ArchivesSpace pane in the Appraisal tab) had a date, but no title. (This is a fairly common practice in our description, permitted in ArchivesSpace as well as content standards like DACS.)

Issues Addressed Locally

Due to the local, idiosyncratic nature of some of some additional issues we identified, we also made a number of fixes to our forks of Archivematica and the Archivematica Storage Service:
  • Archivematica
    • We got rid of a nested "digital_object_component_" in the AIP directory structure, a relic of a time before we decided to simplify the way we model digital objects in ArchivesSpace. Now all digital content is packaged inside a single "objects" folder and hopefully this makes things a bit more straightforward for researchers.
    • We added a "http://hdl.handle.net/" prefix to the Handle written back to the File Version field of ArchivesSpace's Digital Object module so that links to digital content in the finding aids actually work. We toyed with hard-coding this in Archivematica, but Dallas ended up creating an ArchivesSpace plug-in that verifies all URLs with Handles coming to ArchivesSpace (whether or not they're coming from Archivematica).
    • We increased one of the timeouts in storage_service.py to an hour (it was set to two minutes) so that the Archivematica Storage Service could move around larger packages (e.g., at initial transfer, at final deposit, etc.) without timing out.
    • We disabled BulkExtractor scanners except the ones we need to identify the most common forms of sensitive data we encounter, since this application is extremely time and resource intensive. At the time, this application was not configurable in the Format Policy Registry.
    • We updated the default Copyright statement going from Archivematica to DSpace to point researchers to access and use restrictions recorded at the collection-level.
  • Archivematica Storage Service
    • We added a feature to deposit a License Bundle with every AIP going to DSpace. This is one of our internal requirements for all deposits to DSpace.
Of course, if you have questions about any of these, please don't hesitate to get in contact with us!

On Deck for Archivematica 1.7

Looking ahead to Archivematica 1.7, you can expect a couple of additional features related to the ArchivesSpace-Archivematica-DSpace Workflow Integration project, most notably the inclusion of an additional feature that will permit archivists to characterize and review content based on its last modified date.

The new "Last modified" column in the File List pane of the Appraisal tab.

While last-modified dates and times are notoriously unreliable (especially as they change hands or operating systems, e.g., on their way from donor to archive), they can help to give an archivist additional context for a set of files or prepare them for additional preservation steps that might be required for older content, e.g., exploring additional file format migration pathways if the content is of sufficient value.

This release will also contain some work I did to fix a bug that was introduced when the .ZIP functionality was added. The bug occurred when Archivematica tried to update permissions on the "metadata" bitstream when the AIP was packaged using the .ZIP archive file format.

Mission Accomplished (for These Archivists who are at This Institution on Their Mission)


So here we are--we've reached another milestone. As I mentioned at the beginning of the post, as of August 31, 2017, we are officially live with Archivematica and the new features and workflow we developed during the Mellon-funded ArchivesSpace-Archivematica-DSpace Workflow Integration project. In fact, our latest cohort of Project Archivists just started at the beginning of September and they were all trained to use these new tools and workflows--it's all so exciting!

While it's important to say that we've accomplished something--and that we're proud of what we've accomplished!--it's also important to qualify that a bit. What we've got works for us (we think!), at least for now, at least for most of what we're working with. We hope you can take at least some of what we've done (and we tried hard to make sure you could) and make it work for you, too. It's been exciting, for example, to hear about other people's experiences with the Appraisal tab (like this post on "Appraising Appraisal and picking the right tool for the job" by Chris Grygiel).

This has been an amazing journey, and along the way we've learned a lot, not just about Archivematica, but also about software development, project management, working with open source tools and communities, etc. We've said before that the end is just a new beginning--and that remains true today. With that in mind, we know our mission is never "accomplished" as such--we fully expect (and are equally excited for!) all the new challenges and adventures we'll face in Archivematica Land as we move forward.

Until next time!

Thursday, April 6, 2017

An Overview of Archivematica Storage Service Use at the Bentley

When I first encountered Archivematica, I understood it as a pipeline, a chain of microservices, a "sausage-maker." With a little more experience, I realized that this initial impression left out a hugely important part of the Archivematica package: the Storage Service.

As you might have guessed, the Storage Service has to do with storage. Specifically, it allows users to configure the storage spaces (e.g., transfer source locations, AIP and DIP locations, etc.) with which Archivematica interacts.

In short, the Archivematica Storage Service is the heart of Archivematica.

Blood Flow in the Heart


Information Flow in the Archivematica Storage Service
As you can see from this [anatomically correct] diagram, the heart (or Archivematica Storage Service) is made up of chambers (we'll call them Internally and Currently Processing locations, or, more simply, the Archivematica Pipeline). Blood (SIPs) enters the heart (from Transfer Sources) and flows through these chambers; oxygenated blood (AIPs) exits the heart (to AIP and/or DIP Storage).

You can learn more about the heart here.

You can learn more about the Archivematica Storage Service here in this post. (And here. And here.)

Storage Service Structure and Current Use

The Storage Service is made up of a number of different entities: Pipelines, Spaces, Locations and Packages. A Pipeline has Spaces, a Space has Locations; and a Location has Packages:
While it's not obvious from this diagram, the Storage Service can actually be used to configure Spaces and Locations across multiple Pipelines.

Pipelines

Pipelines are essentially Archivematica installations registered by the Storage Service. Although institutions may have many pipelines, we currently use just one for born-digital processing. That being said, we've imagined scenarios where we'd consider adding more pipelines, if another one of the libraries or archives at the University of Michigan wanted to use Archivematica, for example, or if we ever wanted to use Archivematica for more than this one, fairly well-defined workflow and material type.

Spaces

Pipelines have one or more spaces. Spaces allow Archivematica to connect to physical storage (e.g., a local filesystem or a NFS, or even DSpace/Fedora via SWORD v. 2, LOCKSS, DuraCloud or Arkivum), and users input all the necessary information (e.g., remote hostname and location of the export) for Archivematica to do so.

We make use of a number of local filesystem spaces that point to:
  • a "dropbox" that donors and field archivists use to transfer material;
  • a "legacy" space (really, two spaces) containing our old, pre-Archivematica backlog, where we have the automation-tools pointed; and
  • an "archivematica" space that Archivematica uses for ingest processes.
We also have a "DSpace via SWORD2 API" space, which we use to integrate Archivematica and DSpace. The configuration here looks a bit different than in the other local filesystem spaces, and notable differences include:
  • Archivist must enter a DSpace username and password--these are used to authenticate with DSpace. 
  • Archivists must also enter a policy for restricted metadata, in JSON, to override any defaults in DSpace. When AIPs are "repackaged" into "objects" and "metadata" packages, the metadata package will get this policy. In our case, this points to a DSpace "group" that includes a handful of curation and reference archivists here, restricting access to only those archivists.
  • Finally, archivists must select an "Archive format" option. Since we're depositing "packages" of digital objects to DSpace (and DSpace only accepts single objects), you have to package them into a 7z or ZIP file. We make use of the latter, our thinking being that the ".zip" extension is fairly ubiquitous, and that as such there's a greater chance that researchers will recognize it (and know what to do with it).

Locations

Spaces have one or more locations, and locations are where you get into the knitty gritty of associating an individual location on physical storage with particular "purposes" in Archivematica (e.g., transfer source locations, AIP and DIP locations, etc.). This next part was a bit confusing to me the first time I read it, so I'll quote directly from the documentation: "Each Location is associated with at least one pipeline; with the exception of Backlog and Currently Processing locations, for which there must be exactly one per pipeline, a pipeline can have multiple instances of any location, and a location can be associated with any number of pipelines."

We make use of the following locations (organized by purpose):
  • AIP Storage
    • We have a number of these locations that correspond to DSpace collections. This is configured by pointing the Storage Service (and the DSpace space) to the DSpace REST API endpoint for that collection, e.g., https://dev.deepblue.lib.umich.edu/swordv2/collection/TEMP-BOGUS/236280, and giving it a name that you'll see in a dropdown when you get to the store AIP microservice in Archivematica:

    • We also have one location on a local filesystem for for content with restrictions. These AIPs end up going through a more specialized workflow that matches PREMIS Rights Statements we record in Archivematica with the appropriate "group" or access profile in DSpace, functionality that is not included with the standard DSpace integration.
  • Currently Processing: This is the location used by the Archivematica pipeline as it runs transfers through its various microservices. We've learned the hard way that this space takes a lot of management! We frequently run into 500 errors with the automation-tools that end up being caused by this space being full. Part of the reason it fills up quickly is that Archivematica is very conservative, holding onto copies on copies on copies of transfers in various subdirectories for various reasons, e.g., "rejected" (used when transfers are rejected in the dashboard), "failed" (used when transfers fail for some reason, usually because they're too big, which just exacerbates the "being full" problem) and "tmp" directories. These can be emptied through the "Administration" --> "Processing storage usage" tab of the dashboard, but we ended up just making a daily cronjob to empty these out.
  • Storage Service Internal Processing: This is required for the Storage Service to move run, must be locally available to the storage service, and must not be associated with any pipelines.
  • Transfer Backlog: This is where SIPs go when you select the "Send to backlog" option for "Create SIP(s)" in the "Administration --> Processing configuration" tab of the dashboard. This is an optional workflow step, but we make heavy use of it. For us, there can be some time lag between an initial accession of material and its subsequent processing and deposit to DeepBlue. This backlog location is safe and secure and serves as a temporary, "minimally viable" preservation environment for the original digital objects and the logs and METS file generated by Archivematica's initial transfer process. With Archivematica 1.6, thanks to some transfer backlog management development work by Simon Fraser University Archives, you can use a new "Backlog" tab in the dashboard to search and view backlogged transfers, download entire transfers or items from backlog and even perform transfer deletion requests.
  • Transfer Source: Archivematica looks to these locations when creating a new transfer. As mentioned earlier, we use a couple of these, a "dropbox" that donors and field archivists use to transfer material and a "legacy" space containing our old, pre-Archivematica backlog. Material in here is accessed (sometimes slowly if there's a lot in there!) when creating a transfer through the dashboard:
https://www.archivematica.org/en/docs/archivematica-1.6/_images/Browse1.png
Selecting transfer source directories

Packages

Packages are Transfers, SIPs, AIPs and DIPs uploaded to a location managed by the storage service. The Storage Service is also the place where requests to delete packages are fulfilled by an administrator.

Future Ideas for Storage Service Usage

You may have seen our recent post to the Archivematica Tech list about an API endpoint for posting locations to a space. We're interested in this to try to reuse metadata and further automate our own workflows, for example, in this Resource-to-Collection command-line utility we're working on that:
  • creates or updates a DSpace Collection from an ArchivesSpace Resource (using the DSpace API);
  • creates an Archivematica Storage Service Location for the DSpace Collection (in lieu of the endpoint, we're currently using Selenium with Python for this part);
  • creates and links an ArchivesSpace Digital Object for the DSpace Collection to the ArchivesSpace Resource (using the ArchivesSpace API); and
  • notifies the processor (using their Archivematica username) via a message on Slack (using the Slack API).
Deposit away, Dallas!
 
Who knows, maybe this or something like it could be a button in ArchivesSpace one day.


Well, that's enough from us! How do you use the Storage Service? As always, please feel free to leave a comment or drop us an email: bhl-mellon-grant [at] umich [dot] edu. Thanks!