Friday, May 29, 2015

Test-driving your code

In most established archival institutions, any given finding aid can represent decades of changing descriptive practice, all of which are reflected in the EAD files we generate from them. This diverse array of standards and local-practice is what makes our job as data-wranglers interesting, but it also means that with any programmatic manipulation we make, there is always a long tail of edge-cases and outliers that we need to account for, or risk making unintentional and uncaught changes in places we aren't expecting.

When I first came on to the A-Space / Archivematica integration project, this prospect was terrifying - that an unaccounted-for side-effect in my code could stealthily change something unintended, and fall under the radar until it was too late to revert, or, worse, never be caught. After a few days of an almost paralytic fear, I decided to try a writing style known by many in the agile software-development world as Test-Driven Development, or TDD.

After the first day I had fallen in love. Using this methodology I have confidence that the code I am writing does exactly what I want it to, regardless of the task's complexity. Equally valuable, once these tests are written a third party can pick up the code I've written and know right away that any new functionality they are writing isn't breaking what is already there. One could even think of it as a kind of fixity check for code functionality - with the proper tests I can pick up the code years down the line and know immediately that everything is still as it should be.

In this post I will be sharing what TDD is, and how it can be practically used in an archival context. In the spirit of showing, not telling, I'll be giving a walkthrough of what this looks like in practice by building a hypothetical extent-statement parser.

The code detailed in this post is still in progress and has yet to be vetted, so the end result here is not production-ready, but I hope exposing the process in this way is helpful to any others who might be thinking about utilizing tests in their own archival coding.

To start off, some common questions:

What is a test?

A test is code you write to check that another piece of code you have written is doing what you expect it to be doing.

If I had some function called normalize_date that turned a date written by a human, say "Jan. 21, 1991" into a machine-readable format, like "1991-01-21", its test might look something like this:

This would fail if the normalized version did not match expected outcome, leaving a helpful error message as to what went wrong and where.

So what is TDD?

Test-Driven Development is a methodology and philosophy for writing code first popularized and still very commonly used in the world of agile software design. At its most basic level it can be distilled into a three-step cyclic process: 1) write a failing test, 2) write the simplest code you can to make the test pass, and 3) refactor. Where one might naturally be inclined to write code then test it, TDD reverses this process, putting the tests above all else.

Doesn't writing tests just slow you down? What about the overhead?

This is a common argument, but it turns out in many cases tests actually save time, especially in cases where long-term maintainability is important. Say I have just taken on a new position and have responsibility to maintain and update code built before my time. If my predecessors hadn't written any tests I would have to look at every piece of code in the system before I could be confident that any new changes I'm making aren't breaking any current obscure functionality. If there were tests, I could go straight into making new changes without the worry that I might be breaking important things that I had no way to know about.

Ensuring accuracy over obscure edge-cases is incredibly important in an institution like the Bentley. The library's EADs represent over 80 years of effort and countless hours of work on the part of the staff and students who were involved in their creation. The last thing we want to do while automating our xml normalizations is make an unintended change that nullifies their work. Since uncertainty is always a factor when working with messy data, it is remarkably easy for small innocuous code changes to have unintended side-effects, and if one mistake can potentially negate hundreds of hours of work, then the few hours it takes to write good tests is well worth the investment. From a long-term perspective, TDD saves time, money, and effort -- really there's no reason not to do it!

Learn by doing - building an extent parser in python with TDD

That's a lot of talk, but what does it look like in practice? As Max described in his most recent blog post, one of our current projects involves wrestling with verbose and varied extent statements, trying to coerce them into a format that ArchivesSpace can read properly. Since it's on our minds, let's see if we can use TDD to build a script for parsing a long combined extent statement into its component parts.

The remainder of this post will be pretty python heavy, but even if you're not familiar with programming languages, python is unusually readable, so follow along and I think you'll be surprised at how much it makes sense!

To begin, remember the TDD mantra: test first, code later. So, let's make a new file to hold all our test code (we'll call it and start with something simple:

now run it and...

Ta-da! We have written our first failing test.

So now what? Now we find the path of least resistance - the easiest way we can think of to solve the given error. The console suggests that a "split_extents" function doesn't exist, so let's make one! Over in a new file, let's write

Function created! Before we can test it, our test script needs to know where to find the split_extents function, so let's make sure the test script can find it by adding the following to

Now run the test again, and see where that leads us:

Our assert statement is failing, meaning that split_extent_text is not equal to our target output. This isn't surprising considering split_extents isn't actually returning anything yet. Let's fix the assert error as simply as we can:

There! It's cheesiest of fixes (the code doesn't actually do anything with the input string, it just cheekily returns the list we want), but it really is important to do these small, path-of-least-resistance edits, especially as we are just learning the concept of TDD. Small iterative steps keeps code manageable and easy to conceptualize as you build it -- it can be all too easy to get carried away and add a whole suite of functionality in one rushed clump, only to have the code fail at runtime and not have any idea where the problem lies.

So now we have a completely working test! Normally at this point we would take a step back to refactor what we have written, but there really isn't much there, and the code doesn't do anything remotely useful. We can easily break it again by adding another simple test case over in

This test fails, so we have code to write! Writing custom pre-built lists for each possible extent is a terrible plan, so let's write something actually useful:

Run the test, and... Success! Again, here we would refactor, but this code is still simple enough it isn't necessary. Now that we have two tests, we have a new problem: how do we keep track of which is which, or know which is failing when the console returns an error?

Luckily for us, python has a built-in module for testing that can take care of the background test management and let us focus on just writing the code. The one thing to note is that using the module requires putting the tests in a python class, which works slightly differently than the python functions you may be used to. All that you really have to know is that you will need to pre-append any variable you want to use throughout the class with "self.", and include "self" as a variable to any function you define inside the class. Here is what our tests look like using unittest as a framework:

You can run the tests just like you would any other python script. Let's try it and see what happens:

Neat! Now we have a test suite and a function that splits any sentence that has " and " in it. But many extent statements have more than two elements. These tend to be separated by commas, so let's write a test to see if it handles a longer extent statement properly. Over in's setUp function, we'll define two new variables:

Then we'll write the test:

Running the test now fails again, but now the error messages are much more verbose. Here is what we see now that we're using python's testing module:

As you can see, it tells us exactly which test fails, and clearly pinpoints the reason for the failure. Super useful! Now that we have a failing test, we have code to write.

Now the tests pass, but this code is super ugly - time to refactor! Let's go back through and see if we can clean things up a bit.

It turns out, we can reproduce the above functionality in just a few lines, using what are known as list comprehensions. They can be really powerful, but as they get increasingly complicated they have the drawback of looking, well, incomprehensible:

We may return to this later and see if there is a more readable way to do this clearly and concisely.

Now, as always, we run the tests and see if they still pass, and they do! Now that we have some basic functionality we need to sit down and seriously think about the variety and scope of extent statements found in our EADs, and what additional functionality we'll need to ensure our primary edge cases are covered. I have found it helpful at this point to just pull the text of all the tags we'll be manipulating and scan through them, looking for patterns and outliers.

Once we have done this, we need to write out a plan for each case that the code will need to account for. TDD developers will often write each planned functionality as individual comments in their test code, giving them a pre-built checklist they can iterate through one comment at a time. In our case, it might look something like this:

If we build out this functionality out one test at a time, we get something like the following:

The completed test suite:

And here is a more complete, refactored along the way to use regular expressions instead of solely list comprehensions:

That's it! We now have a useful script, confidence that it does only what it is supposed to, and a built-in method to ensure that its functionality remains static over time. I hope you've found this interesting, and I'd love to hear your thoughts on the pros and cons of implementing TDD methods in your own archival work - feel free to leave a message in the comments below!

Tuesday, May 26, 2015

ArchivesSpace Dating Advice

As Max detailed in his recent post on extents, there are some aspects of our EADs that are not necessarily wrong (i.e., won't cause any errors when importing into ArchivesSpace), but that are not optimized to take full advantage of potential reporting or searching functionality in ArchivesSpace or other systems. Whereas Max described some of the problems we have with our extent statements, this post will take a look at another aspect of our EADs that we initially thought would be a simple, easy, quick fix... until we learned more: dates.

Dates in our Current Finding Aids

Currently, our dates are encoded with <unitdate> tags in our EADs, but lack a "normal" attribute containing a normalized, machine-readable version of the date.

As an example, our dates might currently look like this: <unitdate type="inclusive">May 26, 2015</unitdate>
As opposed to this: <unitdate type="inclusive" normal="2015-05-26">May 26, 2015</unitdate>

Until now, this has not really been a problem. As you can see from an example such as the Adelaide J. Hart papers, our dates are presented to users as plain text in our current finding aid access sytem. Under the hood, those dates are encoded as <unitdate> elements, but our access system has no way to search or facet by date. As such, the access system has never needed a normalized, machine-readable form of dates. But what about ArchivesSpace?

Dates in ArchivesSpace

Before getting into what happens to our legacy <unitdate> elements when imported into ArchivesSpace, let's take a look at a blank ArchivesSpace date record.

Based on all of the date fields provided by ArchivesSpace, we can already see here that we're moving beyond plain text representation of our dates. Of particular interest for the purposes of this blog post are the fields for "Expression," "Begin," and "End." Hovering over the * next to "Expression" brings up the following explanation of what that field represents:

What this means is that the "Expression" field will essentially recreate the plain text, human-understandable version of dates that we have been using up until now. Simple enough.

Once we take a look at the "Begin" and "End" fields, however, we can start to see where our past practice and future ArchivesSpace possibilities come into conflict. The "Begin" and "End" fields give us the ability to record normalized-versions of our dates that ArchivesSpace (and other systems) can understand. This is definitely functionality that we will want to use going forward, but what does this mean for our legacy data?

Let's see what happens to our dates when we import one of our legacy EADs into ArchivesSpace.

The ArchivesSpace EAD importer took the contents of a <unitdate> tag and made a date expression of 1854-1888. It did not, however, make a begin date of 1854 or an end date of 1888. Why not? Lines 168-188 of the ArchivesSpace EAD importer can help us understand.

We'll get into a little bit more detail about making sense of the ArchivesSpace EAD importer in future posts about creating our custom EAD importer, but for now we'll take a higher-level view at what this portion of the EAD importer is going. What this bit of the EAD importer is doing is taking a <unitdate> tag and making an ArchivesSpace date record using various components of that <unitdate> tag and its related attributes. At line 178, the importer is making a date expression with the inner_xml of the <unitdate> tag, or the text within the open and closed <unitdate></unitdate> brackets, essentially recreating the plain text version of the dates that we currently have. But how is it making normalized begin and end dates?

On lines 180 and 181, the EAD importer is making a begin date with norm_dates[0] and an end date with norm_dates[1]. If we look at lines 170-174, we can see how those norm_dates are being made. The ArchivesSpace EAD importer is looking for a normal attribute (represented in the EAD importer as "att('normal')") in the <unitdate> tag and splitting the contents of that attribute on a forward slash to get the begin date (norm_dates[0]) and end date (norm_dates[1]).

In order for our example imported date above to have begin and end dates, the <unitdate> tag should look like this:

<unitdate type="inclusive" normal="1854/1888">1854-1888</unitdate>

Right now it looks like this:

<unitdate type="inclusive">1854-1888</unitdate>

Thankfully for us, making normalized versions of dates like the above is actually fairly simple.

Normalizing Common Dates

Similar to how there were many extents that could be cleaned up in one fell swoop, there are many dates that we can normalize by running a single script. The following script will make a normal attribute containing a normalized version of any <unitdate> that is a single year or a range of years. It will also add a certainty="approximate" attribute to any year or range of years that is not made up of only exact dates. Here, for easy reference, are examples of the attributes that the script adds to each of the possible manifestations of dates that are years or ranges of years:
  • A single year (1924): normal="1924"
  • A decade (1920s): normal="1920/1929" certainty="approximate"
  • A range of exact years (1921-1933): normal="1921/1933"
  • A range of a decade to an exact year (1920s-1934): normal="1920/1934" certainty="approximate"
  • A range of an exact year to a decade (1923-1930s): normal="1923/1939" certainty="approximate"
  • A range of a decade to a decade (1920s-1930s): normal="1920/1939" certainty="approximate"

And here is the script:

When this script is ran against our EADs, we get this result:

As you can see, this script added normal attributes to 316,578 of our 415,958 dates. In other words, this single script normalized about 75% of our dates, ensuring that ArchivesSpace will import date expressions, begin dates, and end dates for a majority of our legacy dates.

The Remaining 25% (and other surprises)

In future posts, we'll be going over how we've used OpenRefine to clean up the remaining 25% of our dates that could not be so easily automated, and we'll also be taking a look at some of the other surprising <unitdate> errors we've found lurking in our legacy EADs, including how we've identified and resolved those issues.

These are not the dates you're looking for.

Friday, May 22, 2015

Exten(t)uating Circumstances: 80 Years of Descriptive Practices and the Long Tail(s) of Extents

It all started with a simple error:

Error: #<:ValidationException: {:errors=>{"extents"=>["At least 1 item(s) is required"]}}>

This is the error we got when we tried to import EADs into ArchivesSpace with extent statements that began with text, such as "ca." or "approx." So ArchivesSpace likes extent statements that begin with numbers. Fine. Easy fix. Problem solved.

And it was an easy fix... until we started getting curious.

The Extent (Get It!) of the Problem

As we did our original tests importing legacy EADs into ArchivesSpace (thanks, Dallas!), we started noticing that extents weren't importing quite the way we expected. As it turns out, ArchivesSpace imports the entire statement from EAD's <physdesc><extent> element as the "Whole" extent, with the first number in the statement imported as the "Number"  of the extent and the remainder of the statement imported as the "Type":

An Extent in ArchivesSpace

This results in issues such as the one above, where the number imports fine, but type imports incorrectly. "linear feet and 7.62 MB (online)" is actually a Type plus another extent statement with its own Number, Type and Container Summary. This would be more accurately represented by breaking the extent into two "Part" portions.

This also makes for a very dirty "Type" dropdown list:

I've highlighted the only type that should really be there.

Now, this isn't actually a problem for import to ArchivesSpace. But it is a problem. In the end, we decided to take a closer look at extents to clean them up. That's fun, right? In hindsight, our initial excitement about this was probably a little naive. We were dealing with 80 years of highly varied descriptive practices, after all.

Getting Extents

In his last post, Dallas started to detail how we "get" elements from EADs ("get" here means go through our EADs, grab extent(s), and print them with their filename and location to a CSV for closer inspection and cleaning). In case you're wondering how exactly we did got extents, here is our code (and feel free to improve it!):


 # import what we need  
 import lxml  
 from lxml import etree  
 import csv  
 import os  
 from os.path import join  
 # where are the eads?  
 ead_path = 'path/to/EADs' # <-- you have to change this  
 # where is the output csv?  
 output_csv = 'path/to/output.csv' # <-- you have to change this  
 # "top level" extents xpath  
 extents_xpath = '//ead/archdesc/did//physdesc/extent'  
 # component extents xpath  
 component_extents_xpath = '//ead/archdesc/dsc//physdesc/extent'  
 # all extents xpath  
 all_extents = '//extent'  
 # open and write header row of csv  
 with open(output_csv, 'ab') as csv_file:  
   writer = csv.writer(csv_file, dialect='excel')  
   writer.writerow(['Filename', 'XPath', 'Original Extent'])  
 # creates a function to get extents  
 def getextents(xpath):  
   # go through those files  
   for filename in os.listdir(ead_path):  
     tree = etree.parse(join(ead_path, filename))  
     # keep up with where we are  
     print "Processing ", filename  
     # parse and go through all component extents  
     extents = tree.xpath(xpath)  
     for i in extents:  
       # identify blank extents  
       extent = i.text  
       extent_path = tree.getpath(i)  
       with open(output_csv, 'ab') as csvfile:  
         writer = csv.writer(csvfile, dialect='excel')  
           writer.writerow([filename, extent_path, extent])  
           writer.writerow([filename, extent_path, 'ISSUE EXTENT'])  
 # close the csv  
 # get extents      
 getextents(all_extents) # <-- you'll have to change this to get the extents you want, "top level," component level or all (i want all)  

We weren't exactly thrilled with what we found.

The Long Tail(s) of Exents

Our intern, Walker Boyle, put together a histogram of what we found for both extents and component extents, and I converted them into graphs. You need to click them to get the full effect.



How We're Thinking About Fixing Extents (How Comes Later)

As you can see, we had a bit of a problem on our hands. Our extents are very dirty (perhaps that's an understatement!). We decided to go back to square one. Lead Archivist for Description and Workflow Management Olga Virakhovskaya and I sat down to try to at least come up with a short list of extent types. For just the top level extents (2800+), this was a 3 1/2 hour process (3 1/2 hours!). We didn't even want to think about how long it would take to go through the nearly 59,000 component-level extents. (I just did the math. It would take two business weeks). To make matters worse, by the end of our session, we realized that our thoughts about extents were evolving, and that the list we started creating at the beginning was different than the list we were creating at the end.

Frustrated, we got back together with the A-Team to discuss further and deliberated on the following topics.


Our first thought was to turn to Describing Archives: A Content Standard, or DACS. However, it turns out that DACS is pretty loosey-goosey it comes to DACS, especially the section on Multiple Statements of Extent:

These examples are all over the place!

Needless to say, this didn't help us much.

Human Readable vs. Machine-Actionable Extents

We realized that part of the issue arises from the fact that for pretty much our entire history the text of extent statements has been recorded for the human eyes that will be looking at them, and for those eyes only. ArchivesSpace affords the opportunity for this information to be much more granular and machine readable (and therefore potentially machine-actionable). For instance, we've thought that perhaps we could bring together all extents of a certain Type and add their numbers together to get a total. This wouldn't have been possible before but it might be in ArchivesSpace depending on how well we clean up the extents.

To oversimplify, we decided (at least for the time being) that as we normalize extents we'd like to find a happy medium between flexibility and human-readableness on the one hand, and potential machine-actionability (and consistency for consistency's sake) on the other.

Why Are We Recording This Information Again?

Finally, as with many things in library- and archives-land, every once in a while you find yourself asking, "Why are we doing this again?" This case was no different. We started to really ask ourselves why we were recording this information in the first place, hoping that would inform the creation of a shortlist and a way to move forward.

We turned to user stories to try to figure out the ways that extents might or could get used. That is, not the way they have been or do get used, or even how they will get used, but all the ways they might get used. We thought of these:

First, from the perspective of a researcher...

  1. As a researcher, I need to be able to look at a collection's description and be able to tell quickly how large it is so that I know if I should plan to stay an hour or a week, or look at a portion of a collection or the whole thing.
  2. As a researcher, I'm looking for specific materials (photographs, drawings, audio recordings, etc.) 
  3. As an inexperienced researcher, I don’t know that this information may be found in Scope and Content notes.

And from the perspective of archivists...

  1. As an archivist, I’d like to know how much digital material I have, how much is unique (i.e., born-digital), and how much is not (digitized). This would also be true for microfilmed material.
  2. As an archivist, I need a way to know how much (and what kind) of material I have (e.g., 3,000 audiocassettes; 5,000 VHS tapes, &c.).
  3. As a curation archivist, I need an easy way to distinguish between different types of film across collections (e.g., 8 mm, 16 mm, 35 mm, 2-inch) because the vendor we've selected for digitization only does one or some of these types.
  4. As a curation archivist, I’m working on better a locations/stacks management system. I need to know the physical volume of holdings and the types of physical formats and sizes. 
  5. As a curation archivist, I need a way to know which legacy collections contain obsolete storage media (such as floppy disks of different sizes) so that I can process this digital material, or decide on equipment purchases.
  6. As a reference archivist, I need an easy way to distinguish between different types of film in a collection so that I know whether we have the equipment on site for researchers to view this material.

As you can see, this is a lot to think about!

The Solution

I know you'd really like to know our solution. Well, we've taken care of the easy ones:

Other than the easy ones, however, progress is slow. We're continuing to try to create user stories to inform our thinking, to create a short list of extent types, and to make plans for addressing common extent type issues.

A future post will detail some of the OpenRefine magic we're doing to clean up extents, and another will explain exactly how we're handling these issues and reintegrating them back into the original EADs, code snippets and all. Stay tuned!

In the meantime, why not leave a comment and let us know how and why you use extents!

Tuesday, May 19, 2015

Legacy EAD Clean Up: Getting Started

Previous posts focusing on our work migrating our legacy EADs to ArchivesSpace have discussed the results of legacy EAD import testing and examined the overall scale of and potential solutions for migrating our legacy EADs successfully into ArchivesSpace.

Whereas those posts were generally focused on the bigger picture of migrating legacy metadata to ArchivesSpace, and included details about overall error rates, common errors, and general additional concerns that we must address before migrating our legacy collections information without error and in a way that will ensure maximum usability going forward, this post will be the first in a series that will take a more detailed look at individual errors and the tools and strategies that we have used to resolve them.


As previously mentioned, we have found a great deal of success in approaching our legacy EAD clean up programmatically through the creation of our own custom EAD importer and by using Python and OpenRefine to efficiently clean up our legacy EADs. In order to make use of some of the scripts that we will be sharing in this and future posts, you will need to have the following tools installed on your computer:

Python 2.7.9
Aside from the custom EAD importer (which is written in Ruby), the scripts and code snippets that we will be sharing are written in Python, specifically in Python 2. Python 2.7.9 is the most recent version, but if you have an older version of Python 2 installed on your computer that will also work.

lxml is an XML toolkit module for Python, and is the primary Python module that we use for working with EAD (and, later, with MARC XML). To easily install lxml, make sure that pip is installed along with your Python installation and type 'pip install lxml' into a Command Prompt or terminal window.

To test that you have Python and lxml installed properly, open a Command Prompt (cmd.exe on a Windows computer) or a terminal window (on a Mac or Linux machine) and enter 'python.' This should start an interactive Python session within the window, displaying something like the following:

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win
Type "help", "copyright", "credits" or "license" for more information.

If that doesn't work, check the official Python documentation for help on getting Python set up on your system.

Once you are able to start an interactive Python session in your Command Prompt or terminal window, type 'import lxml' next to the '>>>' and hit enter. If an error displays, something went wrong with the lxml installation. Otherwise, you should be all set up for using Python, lxml, and many of the scripts that we will be sharing on this blog.

A lot of the metadata clean up that we've been doing has been accomplished by exporting the content of certain fields from our EADs into a CSV file using Python, editing that file using OpenRefine, and then reinserting the updated metadata back into our EADs using Python.

The Basics of Working with EADs

Many of the Python scripts that we have written for our EADs can be broken down into several groups, among them scripts that extract metadata to be examined/cleaned in OpenRefine and scripts that help us identify potential import errors that need to be investigated on a case-by-case basis.

1. Extracting metadata from EADs

One of the most common types of scripts that we've been using are those that extract some metadata from our EADs and output it to another file (usually a CSV). We'll get into the specifics of how we've used this to clean up dates, extents, access restrictions, subjects, and more in future posts dedicated to each topic, but to give you an idea of the kind of information we've been able to extract and clean up, take a look at this example script that will print collection level statements from EADs to the Command Prompt or terminal window:

When this script is run against a sample set of EADs, we get the following output:

As you can see from that small sample, we have a wide variety of extent statements that will need to be modified before we import them into ArchivesSpace. Look forward to a post about that in the near future!

2. Identifying errors in EADs

One of the most common types of problems that we identified during our legacy EAD import testing was that there are some bits of metadata that are required to successfully import EADs into ArchivesSpace that are either missing or misplaced in our EADs. As such, we are not always looking to extract metadata from our EADs to clean up and reinsert to fit ArchivesSpace's or our own liking. Sometimes we simply need to know that information is not present in our EADs, or at least is not present in the way that ArchivesSpace expects.

The most common error associated with missing or misplaced information in our EADs is the result of components lacking <unititle> and/or <unitdate> tags. A title is required for all archival objects in ArchivesSpace, and that title can be supplied as either a title and a date, just a title, or just a date.

We have some (not many, but some) components in our EADs that are missing an ArchivesSpace-acceptable title. Sometimes, this might be the result of the conversion process from MS Word to EAD inserting a stray empty component at the end of a section in the container list, such as at the end of a series or subseries. These empty components can be deleted and the EAD will import successfully. Other times, however, our components that lack titles actually ought to have titles; this is usually evident when a component has a container, note, extent, or other kind of description that indicates there really is something being described that needs a title.

So, rather than write a script that will delete components missing titles or modify our custom EAD importer to tell ArchivesSpace to ignore those components, we need to investigate each manifestation of the error and decide on a solution on a case-by-case basis. This script (something like it, anyway) helps us do just that:

This script will check each <c0x> component in an EAD for either a <unittitle>, a nested title within a <unittitle>  (such as <unittitle> <title>), or a <unitdate>. If a component is missing all three acceptable forms of a title, the script will output the filename and the xpath of the component.

A sample output from that script is:

Checking those xpaths in the EAD will take us directly to each component that is missing an acceptable title. From there, we can make decisions about whether the component is a bit of stray XML that can be deleted or if the component really does refer to an archival object and ought to have a title.

For example, the following component refers to something located in box 1 and has a note referring to other materials in the collection. This should have a title.

<c03 level="item">
      <container type="box" label="Box">1</container>
    <note><p>(See also Economic Research Associates)</p></note>

This component, however, is a completely empty element at the end of a element and does not have any container information, description, or other metadata associated with it. This can safely be deleted. 

<c02 level="file"><did><unittitle/></did></c02></c01>

In the coming weeks we'll be detailing how we've used these sorts of strategies to tackle some of the specific problems that we've identified in our legacy EADs, including outputting information from and about all of our EADs to CSVs, cleaning up all the messy data in OpenRefine, and replacing the original, messy data with the new, clean data. We'll also be detailing some things that we've been able to clean up entirely programatically, without needing to even open an EAD or look at its contents in an external program. Legacy metadata clean up here at the Bentley is an ongoing process, and our knowledge of our own legacy metadata issues, our understanding of how we want to resolve them, and our skills to make that a possibility are constantly evolving. We can't wait to share all that we've learned!

Friday, May 15, 2015

Dystopia and Digital Preservation: Archivematica's Character Flaws (and Why We're OK with Them)

A previous post in this series looked at lessons from "No Silver Bullet: Essence and Accidents of Software Engineering" and how they apply to ArchivesSpace, an open source archives information management application we'll be using as part of our Archivematica-ArchivesSpace-DSpace Workflow Integration project.

Today, I'd like to continue in that vein (i.e., that none of these pieces of software are perfect, and that they don't meet every one of our needs, but in the end that's OK) and take a look at what we can learn about Archivematica from a genre of literature and other artistic works that I'm rather fond of: dystopian fiction.

Welcome to Dystopia

Is it dystopia? [1]

A dystopia is an imaginary community or society that is undesirable or frightening; it literally translates to "not-good-place." Dystopian fiction--a type of speculative fiction because it's generally set in a possible future--usually involves the "creation of an utterly horrible or degraded society headed to an irreversible oblivion." [2]

Setting the Scene: Characteristics of Our [Future] Dystopian Society

If there's one thing you can say about those of us who are interested in digital curation and preservation, it's that we're wary of the very real possibility of our future being such a "not-good-place." Here are three (and a half) reasons why:

  1. Society is an illusion of a perfect utopian world.

In The Matrix, a 1999 film by the Wachowski brothers, reality (not quite a utopia, but still) as perceived by most humans is actually a simulated reality called "the Matrix," created by sentient machines to subdue the human population. [3]

The first tell that you're living in a dystopia is that things are "pretty perfect," and I'd argue that, for the casual user of digital material, things certainly seem "pretty perfect." For various reasons, including the fact that for the majority of us the complex technology stack needed to render them is "invisible," digital materials appear as if they'll be around forever (for example, I spend a good deal of time looking for things to link to on this blog, knowing all the while that the lifespan of a URL is, on average, 44 days), or that you can preserve them by just "leaving them on a shelf" like you would a book (bit rot!). However, whether it's due to file format obsolescence or storage medium corruption or insufficient metadata or issues with storage or organizational risks (or...or...or...), in reality digital materials are much more fragile than their physical counterparts. This illusion of permanence has all kinds of implications, not the least of which is that it can be difficult to convince administrators that digital preservation is a real thing worth spending money on.

Whether we subscribe to a full-blown "digital dark age" as asserted by Terry Kuny at the 1997 International Federation of Library Associations and Institutions (IFLA) Council and General Conference (barbarians at the gates and all!), or our views are a bit more hopeful, all of us in the field know that there are many, many threats to digital continuity, and that these threats jeopardize "continued access to digital materials for as long as they are needed." (That's from my favorite definition of digital preservation, by the way.)

  2. A figurehead or concept (OAIS, anybody?) is worshiped by the citizens of the society.

In Nineteen Eighty-Four, written in 1948 by George Orwell, "Big Brother" is the quasi-divine Party leader who enjoys an intense cult of personality. [4]

A second clue that you're living in a dystopia is that a figurehead or, in our case, concept, is worshiped by the citizens of the society.

While "worship" may be a bit strong for the relationship that the digital curation community has with the Open Archival Information System (OAIS) Reference Model, you can't argue that it "enjoys an intense cult of personality." It informs everything we do, from systems to audits to philosophies. Mike likes to joke that every presentation on digital preservation has to have an "obligatory" OAIS slide. I like to joke that OAIS is like a "secret handshake" among our kind. Big Brother is watching!

I'm not trying to imply that OAIS's status is a bad thing. However, it does lead us to another, related characteristic of a dystopian society (and this is the half): strict conformity among citizens and the general assumption that dissent and individuality are bad. Don't believe me? Gauge you're reaction when I say what I'm about to say:

We don't create Dissemination Information Packages (DIPs).

That's right. We don't. Just Archival Information Packages (AIPs). [Gasp!]

Strictly speaking, we provide online access to our AIPs, so in a way they act as DIPs. We just don't, for example, ingest a JPG, create a TIFF for preservation and then create another JPG for access. Storage is a consideration for us, as is the processing overhead that we would have to undertake if we wanted to do access right (for example, for video, which would need a multiplicity of formats to be streamable regardless of end user browser or device), as is MLibrary's longstanding philosophy that guides our preservation practices: to unite preservation with access.

As it was put to me by a former colleague (now at the National Energy Research Scientific Computing Center, or NERSC):

This has put us to some degree at odds with practices that are (subjectively, too) strictly based on OAIS concepts, where AIPs and DIPs are greatly variant, DIPs are delivered, and AIPs are kept dark and never touched.

Our custom systems - both DLXS and HathiTrust - deliver derivatives that are created on-the-fly from preservation masters, essentially making what in OAIS terms one might call the DIP ephemeral, reliably reproducible, and even in essence unimportant. (We have cron jobs that purge derivatives after a few days of not being used.) That design is deliberately in accordance with the preservation philosophy here.

Our DSpace implementation is the exception due to the constraints of the application, but it's worth noting we've generally decided *against* approaches that we could have taken that would have involved duplication, such as a hidden AIP and visible DIP (when I asked this question, I was in "total DIP mode"), and I think that is again a reflection of the engrained philosophy here. We've instead aimed for an optimized approach, preserving and providing content in formats that we believe users will be able to deal with.

"To some degree." "Subjectively." "In essence unimportant." Even though this all sounds very reasonable, I'm not sure that my former colleague realizes that we're living in a dystopia, and that Big Brother is watching! You can't just say stuff like that! 2 + 2 = 5!

There's more that I could say about OAIS (e.g., that it assumes that information packages are static in a way that hardly ever reflects reality, and that it doesn't focus enough on engaging with content creators and end users), but that's a post for another day.

  3. Society is hierarchical, and divisions between the upper, middle and lower class are definitive and unbending.

In the novel Brave New World, written in 1931 by Aldous Huxley, a class system is prenatally designated in terms of Alphas, Betas, Gammas, Deltas and Epsilons, with the lower classes having reduced brain-function and special conditioning to make them satisfied with their position in life. [5]

A last characteristic of dystopias is that they are hierarchical, and you can't do anything about it. And let's face it, our digital curation society is hierarchical. We all look to the same big names and institutions, and as someone who came from a small- to medium-sized institution, and as someone who now works at an institution without our own information technology infrastructure, I can tell you first hand that digital preservation, at least at first, can seem like a "rich person's game." For the "everyman" institution (to use a literary trope often found in dystopian fiction, with apologies for it not being inclusive) with some or no financial resources, or without human expertise, it can be hard to know where to start, or even make the case in the first place for something like digital preservation that by its very nature doesn't have any immediate benefits.

As an aside (my argument is going to fall apart!), I think this "class system" is more psychological than anything else. If you are that "everyman" institution, there's a ton that pretty much anyone can do to get started. If you're looking for inspiration, here it is:

  • You've Got to Walk Before You Can Run: Ricky Elway’s report addresses some of the very basic challenges of digital preservation in the real world.
  • Getting Started with Digital Preservation: Kevin Driedger and myself talk about initial steps in the digital preservation "dance."
  • 'Good Enough' Really Is Good Enough: Mike and myself (in my old stomping grounds!), and our colleague Aaron Collie make the case that OAIS-ish, or 'good enough,' is just that. You don't have to be big to do good things in digital preservation.
  • National Digital Stewardship Alliance Levels of Preservation: I like this model because it acknowledges that you don't have to jump into the deep end with digital preservation. Instead, the model moves progressively from "the basic need to ensure bit preservation towards broader requirements for keeping track of digital content and being able to ensure that it can be made available over longer periods of time." 
  • Children of Men: Theo Faron, a former activist who was devastated when his child died during a flu pandemic, is the "archetypal everyman" who reluctantly becomes a savior, leading Kee to the Tomorrow and saving humanity! Oh wait...

Enter Archivematica, the Protagonist

It is within this dystopian backdrop that we meet Archivematica, our protagonist. Archivematica is a web- and standards-based, open-source application which allows institutions to preserve long-term access to trustworthy, authentic and reliable digital content. And according to their website, Archivematica has all of the makings of a hero who will lead the way in our conflict against the opposing dystopian force:

  • It is standards-based.

Not only is it in compliance with the OAIS Functional Model (there it is again!), it uses well-defined metadata schemes like METS, PREMIS, Dublin Core and the Library of Congress BagIt Specification. This makes it very interoperable, which is why we can use it in our Archivematica-ArchivesSpace-DSpace Workflow Integration project.

  • It is open source.

Here it is! Just waiting for you to modify, improve and distribute it! The documentation is released under a Creative Commons Attribution ShareAlike license, and the code is released under a GNU Affero General Public Licence, nobody can ask any questions. So really, go ahead!

  • It's built on microservices.

Microservices is a software architecture style, in which complex applications are composed of small, highly decoupled and independent processes. When you find a better tool to do a particular job, you can just replace one microservice with another rather than the whole software package. This type of design was highly influential in AutoPro.

  • It is flexible and customizable

Archivematica provides several decision points that give the user almost total control over processing configurations. Users may also preconfigure most of these options for seamless ingest to archival storage and access:

Processing Configuration

  • It is compatible with hundreds of formats.

Archivematica maintains a Format Policy Registry (FPR). The FPR is a database which allows Archivematica users to define format policies for handling file formats, for example, the actions, tools and settings to apply to a file of a particular file format (e.g., conversion to a preservation format, conversion to an access format).

Actually, with a little luck, the FPR is about to get a whole lot better.

  • It is integrated with third-party systems.

Archivematica is already integrated with DSpace, CONTENTdm, Islandora, LOCKSS, AtoM, DuraCloud, OpenStack and Archivist's Toolkit, and it's about to be integrated with ArchivesSpace!

  • It has an active community.

Archivematica has an active community, including a Google Group (check out the question we posed just this week). Check out their Twitter, GitHub, and Youtube accounts as well.

  • It improves and extends the functionality of AutoPro.

This one relates only to us, but Archivematica (with two notable exceptions) is more scalable, handles errors better and is easier to maintain than our homegrown tool AutoPro, which we've been using for the last three to five years or so to process digital materials.

  • It is constantly improving.

This is a big one. Artefactual Systems, Inc., in concert with Archivematica's users, are constantly improving the application. The fact that whenever one person or institution contributes resources, the entire community benefits was a big motivation for our involvement. You can even monitor the development roadmap to see where they're headed!

Archivematica's Character Flaws

That's a lot about what makes Archivematica awesome. But it's not perfect. In literature, a character flaw is a "limitation, imperfection, problem, phobia, or deficiency present in a character who may be otherwise very functional." [5] Archivematica's character flaws may be categorized as minor, major and tragic.

Minor Flaws

Minor flaws serve to distinguish characters for the reader, but they don't usually affect the story in any way. Think Scar's scar from The Lion King, which services to distinguish him (a bit) from the archetypal villain, or the fact King Arthur can't count to three in Monty Python and the Holy Grail (the Holy Hand Grenade of Antioch still gets thrown!).

I can think of these:

  • The responsive design is nice (even though I can't think of a time I'd ever be arranging anything on my cell phone), but the interface has something akin to Scar's scar.

I don't know why the overlap between the button next to my username and "Connected" bothers me so much, but it does.
  • Also, who names their development servers after mushrooms? 

Major Flaws

Major flaws are much more noticeable than minor flaws, and they are almost invariably important to the story's development. Think Anakin Skywalker's anger and fear of losing his wife Padme, which eventually consume him, leading to his transformation into Darth Vader, or Victor Frankenstein's excessive curiosity, leading to the creation of the monster that destroys his life.

Indeed, Archivematica has a few flaws that are important to this story's development.


Archivematica indexes AIPs, and can output them to a storage system, but, as the recent Preserving (Digital) Objects with Restricted Resources (POWRR) White Paper suggests, Archivematica is not a storage solution:

Notice all the gray above Storage?

If you're interested in long-term preservation, your digital storage system should be safe and redundant, perhaps using different storage mediums, with at least one copy in a separate geographic location. Since Archivematica does not store digital material, the onus is on the institution to get this right. 

To be fair, from the beginning, Archivematica has not focused on storage, instead focusing on producing--and later, indexing--a very robust AIP and integrating with other storage tools such as Arkivum, DuraCloud, LOCKSS and DuraSpace (including the recent launch of  ArchivesDirect, a new Archivematica/DuraSpace hosted service), and besides those just about any type of storage you can think of. I still feel I have to classify this as a major flaw, though, since quality storage is at the core of a good digital preservation system.

Active, Ongoing Management

A second major character flaw for Archivematica is that it is not a means for the active, ongoing management aspect of digital preservation, which is really what ensures that digital materials will be accessible over time as technologies change. Again, the POWRR White Paper:

Notice all the gray above Maintenance?

Archivematica doesn't currently have functionality to perform preservation migrations on AIPs that have already been processed. Even though I'd argue that this isn't as central to long-term preservation as quality storage is, it will eventually become an issue for institutions trying to maintain accessibility to digital objects over time.

Archivematica also does not have out-of-the-box functionality to do integrity checks on stored digital objects. Even though I have to admit that after recording an initial message digest, I haven't actually heard of a lot of "everyman" institutions performing periodic audits or doing them in response to a particular event, this seems like a deficiency in the area of file fixity and data integrity.

That being said, the 1.4 release of Archivematica is said to bring the beginnings of functionality to re-ingest digital content. Also, there is a command line fixity tool integrated with the Storage Service, it just isn't really usable out-of-the-box for your typical archivist.


I should have included this in last week's post about ArchivesSpace as well. Documentation issues are a "known issue" with many open source projects, and ArchivesSpace and Archivematica are no different. There have been a number of times where I have looked for some information on the Archivematica wiki (for example, on the Storage Service API, Version 1.4, etc.) and have found the documentation to be missing or incomplete. Lack of documentation can be a real barrier to implementation.

On the upside, documentation is something we can all contribute to (even if we aren't coders)! I for one am going to be looking into this, starting with this conversation.

An update! That was fast!

And this one:

Initial QC on Significant Characteristics

Some digital preservation systems and workflows perform checks on significant characteristics of the content of digital objects before and after normalization or migration. For example, if you're converting a Microsoft Word document to PDF/A, a system or workflow might check the word count on either end of that transformation. Currently, the only quality control that Archivematica does is to check that a new file exists, and that its size isn't zero (i.e., that it has data).

However, it is possible to add quality control functionality in Archivematica, it just isn't well documented (see the above). In the FPR, you can define verification commands above and beyond the basic default commands. There's some more homework for me.


While Archivematica produces a lot of technical metadata about the digital objects in your collections, there isn't really a way to manage this information or share it with administrators via reports. Even basic facts, such as total extent or size of collection and distribution of file types or ages are not available in a user friendly way. This is true about collections as a whole, but also for individual Submission Information Packages (SIPs) or AIPs.

Two recent developments are worth mentioning. First, there's a tool Artefactual Systems developed for Museum of Modern Art (MoMA): Binder (which just came out this week!). It solves this problem, for example, by allowing you to look a the technical metadata of digital objects in a graphical user interface, run and manage fixity checks of preserved AIPs (receiving alerts if a fixity check fails), and generating and saving statistical reports on digital holdings for acquisitions planning, preservation, risk assessment and budget management. Actually, it does even more than that, so be sure to check out the video. We can't wait to dig into Binder.

The second development has to do with our project! Part of the new Appraisal and Arrangement tab will be new reporting functionality to assist with appraisal. This will (we hope!) include technical information about the files themselves--some code may be borrowed from Binder--as well as information about Personally Identifiable Information (PII):

Transfer Backlog Pane

Tragic Flaws

Tragic flaws are a specific sort of flaw in an otherwise noble or exceptional character that bring about his or her own downfall and, often, eventual death. Think Macbeth's hubris or human sin in Christian theology.

While all of this is a little dramatic for our conversation here, there is one very important thing that Archivematica doesn't do:

Archivematica does NOT ensure that we never lose anything digital ever again.

Besides the fact that Archivematica suffers from all of the same "essential" difficulties in software engineering as ArchivesSpace (namely, complexity, conformity, changeability and invisibility--and for pretty much all of the same reasons, I might add), it is also not some kind of comprehensive "silver bullet" that will protect our digital material for all time. It's just not, which leads me to...

The Reveal! Why All of This is OK with Us

Actually, there is no such thing as a "comprehensive" digital preservation solution, so we can't really hold this against Artefactual Systems, Inc. Anne R. Kenney and Nancy McGovern, in "The Five Organizational Stages of Digital Preservation," say it best:

Organizations cannot acquire an out-of-the-box comprehensive digital preservation program— one that is suited to the organizational context in which the program is located, to the materials that are to be preserved, and to the existing technological infrastructure. Librarians and archivists must understand their own institutional requirements and capabilities before they can begin to identify which combination of policies, strategies, and tactics are likely to be most effective in meeting their needs.

Just like ArchivesSpace, Archivematica has a lot going for it. We are especially fond of its microservices design, its incremental agile development methodology, and its friendly and knowledgeable designers.

We love the fact that Archivematica is open source and community-driven, and we try to participate as fully as we can to that community, and intend to do so even more in the future. We do that financially, obviously, but also by participating on the Google Group, and contributing user stories for our project and ensuring that the code developed for it will be made available to the public. You should too!

Conclusion: The Purpose of Dystopian Fiction

To have an effect on the reader, dystopian fiction has to have one other trait: familiarity. The dystopian society must call to mind the reader's own experience. According to Jeff Mallory, "if the reader can identify with the patterns or trends that would lead to the dystopia, it becomes a more involving and effective experience. Authors use a dystopia effectively to highlight their own concerns about society trends." Good dystopian fiction is a call to action in the present.

By focusing on automating the ingest process and producing a repository agnostic, normalized, and well-described (those METS files are huge!) AIP, and doing so in such a way that institutions of all sizes can do a lot or even a little with digital preservation, Archivematica addresses those concerns really well. That, coupled with the fact that staff there are also active in other community initiatives, such as the Hydra Metadata Working Group and IMLS Focus, definitely make them not only protagonists, but heroes in this story.

In the end, Archivematica is our call to action to be heroes in this story as well!

[1] Is it Dystopia? A flowchart for de-coding the genre by Erin Bowman is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. Based on a work at Feel free to share it for non-commercial uses.
[2] Dystopia (this version)
[3] "The Matrix Poster" by Source. Licensed under Fair use via Wikipedia -
[4] "1984first" by George Orwell; published by Secker and Warburg (London) - Brown University Library. Licensed under Public Domain via Wikipedia -
[5] "BraveNewWorld FirstEdition" by Source. Licensed under Fair use via Wikipedia -
[6] Character flaw (this version)

Tuesday, May 12, 2015

Maximizing Microservices in Workflows

Today I wanted to talk a little (maybe a lot?) about our development of ingest and processing workflows for digital archives at the Bentley Historical Library, with a focus on the role of microservices.

Workflow Resources

Maybe we should pause for wee bit of context (we are archivists, after all!).  As I mentioned in a previous post, our 2010-2011 MeMail project gave us a great opportunity to explore and standardize our procedures for preparing born-digital archives for long-term preservation and access.   It was also a very fruitful period of research into emerging best practices and procedures.

At the time, there wasn't a ton of publicly available documentation on workflows established by peer institutions, but the following projects proved to be tremendously helpful in our workflow planning and development:
  • Personal Archives Accessible in Digital Media (paradigm) (2005-2007): an excellent resource for policy questions and considerations related to the acquisition, appraisal, and description of digital personal papers.  By not promoting specific tools or techniques (which would have inevitably fallen out of date in the intervening years), the project workbook has remained a great primer for collecting and processing digital archives.

  • AIMS Born-Digital Collections: An Inter-Institutional Model for Stewardship (2009-2011): another Mellon-funded project that involved the University of Virginia, Stanford University, University of Hull (U.K.) and Yale University.  The project blog provided a wealth of information on tools, resources, and strategies and their white paper is essential reading for any archivist or institution wrestling with the thorny issues of born-digital archives, from donor surveys through disk imaging.

  • Practical E-Records: Chris Prom's blog from his tenure as a Fulbright scholar at the University of Dundee's Center for Archive and Information Studies yielded a lot of great resources for policy and workflow development as well as reviews of handy and useful tools.

  • Archivematica: we first became aware of Archivematica at the 2010 SAA annual meeting, when Tim Pyatt and Seth Shaw featured it in a preconference workshop. While the tool was still undergoing extensive development at this point (version 0.6), the linear nature of its workflow and clearly defined nature of its microservices were hugely influential in helping us sketch out a preliminary workflow for our born-digital accessions.
Using the above as guidelines (and inspiration), we cobbled together a manual workflow that was useful in terms of defining functional requirements but ultimately not viable as a guide for processing digital archives due to the many potential opportunities for user error or inconsistencies with metadata collection, file naming, copy operations, etc.

These shortcomings led me to automate some workflow steps and ultimately produced the AutoPro tool and our current ingest and digital processing workflow:
  • Preliminary procedures to document the initial state of content and check for potential issues
  • Step 1: Initial survey and appraisal of content
  • Step 2: Scan for Personally Identifiable Information (primarily Social Security numbers and credit card numbers)
  • Step 3: Identify file extensions
  • Step 4: File format conversions
  • Step 5: Arrangement and Description
  • Step 6: Transfer and Clean Up


One of the greatest lessons I took from the AIMS project and Archivematica was the use of microservices; that is, instead of building a massive, heavily interdependent system, I defined functional requirements and then identified a tool that would complete the necessary tasks.  These tools could then be swapped out or shifted around in the workflow to permit greater flexibility and easier implementation.

Rather than dwell too extensively on the individual procedures in our workflow (that's what the manual is for!), I would like to provide some examples of how we accomplish steps using various command prompt/CMD.EXE utilities as microservices.  Having said that, I feel compelled to call attention to the following:

  • I am an archivist, not a programmer; at one point, I thought I could use Python for AutoPro, but quickly realized I had a much better chance of stringing something together with Windows CMD.EXE shell scripts, as they were easy to learn and use. Even then, I probably could have done better....give a holler if you see any egregious errors!

  • For a great tutorial on commandline basics, see A/V PReserve's "Introduction to Using the Command Line Interface for Working with Files and Directories."

  • As I hinted above, we're a Windows shop and the following code snippets reflect CMD.EXE commands.  Many of these applications can be run on Mac/Linux machines via native versions or WINE.

  • The CMD.EXE shell needs to know where non-native applications/utilities are located; users should CD into the appropriate directory or include the full systems path to the application in the command.

  • If any paths (to applications or files) contain spaces, you will need to enclose the path in quotation marks.

  • The output of all these operations are collected in log files (usually by redirecting STDOUT) so that we have a full audit trail of operations and a record of any actions performed on content.

  • In the code samples, 'REM' is used to comment out notes and descriptions.

Preliminary Procedures

Upon initiating a processing session, we run a number of preliminary processes to document the original state of the digital accession and identify any potential problems. 

Virus Scan

The University of Michigan has Microsoft System Center Endpoint Protection installed on all its workstations.  Making the best of this situation, we use the MpCmdRun.exe utility to scan content for viruses and malware, first checking to make sure the antivirus definitions are up to date:

 REM _procDir=Path to processing folder  
 "C:\Program Files\Microsoft Security Client\MpCmdRun.exe" -SignatureUpdate -MMPC  
 "C:\Program Files\Microsoft Security Client\MpCmdRun.exe" -scan -scantype 3 -file %_procDir%  

Initial Manifest

Content is stored in our interim repository using the Library of Congress BagIt specification.  When ingest and processing procedures commence, we create a new document to record the structure and size of the accession using diruse.exe and md5deep:

 REM _procDir=Path to processing folder  
 diruse.exe /B /S %_procDir%  
 CD /D %_procDir%  
 md5deep.exe -rclzt *   

Diruse.exe will output the entire directory hierarchy (thanks to the /S option) and provide the number of files and relative size (in bytes, due to the /B option) in addition to providing total number of files and size for the main directory.

For md5deep, changing to the processing directory will facilitate returning relative paths for content.  Our command includes the following parameters:
  • -r: recursive mode; will traverse the entire directory structure
  • -c: produces comma separated value output
  • -l: outputs relative paths (as dictated by location on command prompt)
  • -z: returns file sizes (in bytes)
  • -t: includes timestamp of file creation time
  • *: the asterisk indicates that everything in the present working directory will be included in output.

Extract Content from Archive Files

In order to make sure that content stored in archive files is extracted and run through important preservation actions, we search for any such content and use 7-Zip to extract content.

First, we search the processing directory for any archive files, and save the full path to a text file:

 CD /D %_procDir%  
 DIR /S /B *.zip *7z *.xz *.gz *.gzip *.tgz *.bz2 *.bzip2 *.tbz2 *.tbz *.tar *.lzma *.rar *.cab *.lza *.lzh | FINDSTR /I /E ".zip .7z .xz .gz .gzip .tgz .bz2 .bzip2 .tbz2 .tbz .tar .lzma .rar .cab .lza .lzh" > ..\archiveFiles.txt  

The dir utility (similar to "ls" on a Mac or Linux terminal) employs the /S option to recursively list content and the /B option to return full paths.  The list of file extensions (by no means the best way to go about this, but...) will only return paths that match this pattern.  For greater accuracy, we then pipe ("|") this output to the findstr ("find string") command, which uses the /I option for a case-insensitive search and /E to match content at the end of a line.

We then iterate through this list with a FOR loop and send each file to our ":_7zJob" extraction function with the filename (%%a) passed along as a parameter:

 FOR /F "delims=" %%a in (..\archiveFiles.txt ) DO (CALL :_7zJob "%%a")  
 REM when loop is done, GOTO next step  
 REM Create folder in _procDir with the same name as archive; if folder already exists; get user input  
 SET _7zDestination="%~dpn1"  
 MKDIR %_7zDestination%
 REM Run 7zip to extract files  
 7z.exe x %1 -o%_7zDestination%   
 REM Record results (both success and failure)  
      GOTO :EOF  
 )     ELSE (  
      GOTO :EOF  

As the path to each archive file is sent to :_7zJob, we use the CMD.EXE's built-in parameter extension functionality to isolate a folder path using the root name as the archive file (%~dpn1; Z:\unprocessed\9834_0001\ thus would be Z:\unprocessed\9834_0001\newsletters).  This path will be the destination for files extracted from a given archive file; we save it as a variable (%_7zDestination%) and create a folder with the MKDIR command.

We then run 7-Zip, using the 'x' option to extract content from the archive file (represented by %1) and use the -o option to send the output to our destination folder.  Finally we check the return code (%ERRORLEVEL%) for 7-Zip; if it is not equal to 0 then extraction has failed.  Our production script includes an option to retry the operation.

Length of Paths

Because Windows cannot handle file paths longer than 255 characters, we run tlpd.exe ("Too Long Paths Detector") to identify any files or directories that might cause us trouble.

 REM _procDir=Path to processing folder  
 START "TOO LONG PATHS" /WAIT tlpd.exe %_procDir% 255  

As we're calling this application from a batch (".bat") file, I use the START command to launch it in a new shell window and add the /WAIT option so that the script will not proceed to the next operation until this is comple"255" utility lets you specify the path length, as tlpd.exe lets you adjust the search target.

Step 1: Initial Survey

In the initial review and survey phase of the workflow, AutoPro incorporates a number of applications to view or render content (Quick View Plus, Irfanview, VLC Media Player, Inkscape) and also employs TreeSize Professional and several Windows utilities to analyze and characterize content.  We'll take a closer look at these latter tools in a forthcoming post on appraising digital content.

Step 2: PII Scan

This step nicely illustrates the flexibility of a microservice approach to workflow design, as we are currently using our third different application for this process.  Early iterations of the workflow employed the Cornell Spider, but the high number of false positives (i.e., nine digit integers interpreted as SSNs) made reviewing scan results highly labor-intensive.  (Cornell no longer hosts a copy, but you can check it out in the Internet Archive.)

We next employed Identity Finder, having learned of it from Seth Shaw (then at Duke University).  This tool was much more accurate and included functionality to redact information from plain text and Microsoft Office Open XML files.  At the same time, Identity Finder was rather expensive and a change in its enterprise pricing at the University of Michigan (and the open source nature of our Mellon grant development), have led us to a third solution: bulk_extractor.

Already featured in Archivematica and a prominent component of the BitCurator project, bulk_extractor provides a rich array of scanners and comes with a viewer to inspect scan results.  I am in the processing of rewriting our PII scan script to include bulk_extractor (ah...the glory of microservices!) and will probably end up using some variation on the following command:

 bulk_extractor -o "Z:\path\to\output\folder" -x aes -x base64 -x elf -x email -x exif -x gps -x gzip -x hiberfile -x httplogs -x json -x kml -x msxml -x net -x rar -x sqlite -x vcard -x windirs -x winlnk -x winpe -x winprefetch -R "Z:\path\to\input"  

We are only using a subset of the available scanners; the "-x" options are instructing bulk_extractor to exclude certain scanners that we aren't necessarily interested in.

We're particularly interested in exploring how the BEViewer can be integrated into our current workflow (and possibly into Archivmatica's new Appraisal and Arrangement tab? We'll have to see...).  In any case, here's an example of how results are displayed and viewed in their original context:

Step 3: Identifying File Extensions

The identification of mismatched file extensions is not a required step in our workflow is intended solely to help end-users access and render content.  

As a first step, we run the UK National Archives' DROID utility and export a report to a .csv file.  Before running this command, we open up the tool preferences and uncheck the "create md5 checksum" option so that the process runs faster. 

 REM Generate a DROID report  
 REM _procDir = processing directory  
 java -jar droid-command-line-6.1.5.jar -R -a "%_procDir%" -p droidExtensionProfile.droid   
 REM Export report to a CSV file  
 java -jar droid-command-line-6.1.5.jar -p droidExtensionProfile.droid -e extensionMismatchReport.csv   

In the first command, DROID recursively scans our processing directory and outputs to our profile file (droidExtensionProfile.droid).  In the second, we export this profile to a .csv file, one column of which indicates file extension mismatch with a value of true (the file extension does not match the format profile detected by DROID) or false (extension is not in conflict with profile).

Basic CMD.EXE is pretty lousy at parsing .csv files, so I do one extra step and make this .csv file a tab delimited file, using a Visual Basic script I found somewhere on the Internets.  (This is getting ugly--thanks for sticking with us!)

We then loop through this tab delimited file and pull out all paths that have extension mismatches:

 FOR /F "usebackq tokens=4,13,14,15 delims=     " %%A in (`FINDSTR /IC:"     true     " "extensionMismatchReport.tsv"`) DO CALL :_fileExtensionIdentification "%%A" "%%B" "%%C" "%%D"  

Once again we use our FOR loop, with the tab character set as the delimiter.  We will loop through each line of our extension mismatch report, looking for where DROID returned "true" in the extension mismatch column and we'll then be pulling out information from four columns and pass these as arguments to our ":_fileExtensionIdentification" function: 4 (full path to content), 13 (file extension; employed to identify files with no extension ), 14 (PUID, or PRONOM Unique IDentifier), and 15 (mime type).

Once this information is passed to the function, we first run the TrID file identifier utility:

 trid.exe %_file%  

Based upon the file's binary signature, TrID will present the likelihood of the file being a format (and extension) as a percentage:

Because the output from this tool may be indeterminate, we also use curl to grab the PRONOM format profile (using the PUID as a variable in the command), save this information to a file, and then look for any signature tags that will enclose extension information:

 curl.exe > pronom.txt  
 TYPE pronom.txt | FINDSTR /C:"<Signature>"  

The TYPE command will print a file to STDOUT and we then pipe this to FINDSTR to identify only those lines that include extensions.

Based upon the information from the these tools, the archivist may elect to assign a new extension to a file (which choice is recorded in a log file) or simply move on the the one if neither utility presents compelling evidence.

Step 4: Format Conversion

Following the lead of Archivematica, we've chosen to create preservation copies of content in 'at-risk' file formats as a primary preservation strategy.  In developing our conversion pathways, we conducted an extensive review of community best practices and were strongly influence by the Library of Congress's "Sustainability of Digital Formats", the Florida Digital Archive's "File Preservation Strategies", and Archivematica's format policies.

This step involves searching for "at-risk" formats by extension (another reason we've incorporated functionality for file extension identification) and then looping through each list and sending content to different applications. We also calculate an eight character CRC32 hash for each original file and append it to the new preservation copy to (a) avoid file name collisions and (b) establish a link between the preservation and original copies.  Below are some of our most common conversion operations:

Raster Images: .bmp .psd .pcd .pct .tga --> .tif (convert.exe utility from ImageMagick)

 convert.exe "%_original%" "%_preservation%.tif"  

Vector Images: .ai .wmf .emf --> .svg (Inkscape)

 inkscape.exe -f "%_original%" -l "%_preservation%.svg"  

.PDF --> .PDF/A (Ghostscript)

 gswin64.exe -sFONTPATH="C:\Windows\Fonts;C:\Program Files\gs\gs9.15\lib" -dPDFA -dBATCH -dNOPAUSE -dEmbedAllFonts=true -dUseCIEColor -sProcessColorModel=DeviceCMYK -dPDFACompatibilityPolicy=1 -sDEVICE=pdfwrite -sOutputFile="%_preservation%" "%_original%"  

In the above example, I'm using a 64 bit version of Ghostscript.  I won't even try to unpack all the options associated with this command, but check out the GS documentation for more info.  Note that if you update your file with the location of an ICC color profile, you will need to use double backslashes in the path information.

Audio Recordings: .wma .ra .au .snd --> .wav (FFmpeg)

 REM Use FFprobe to get more information about the recording  
 ffprobe.exe -loglevel panic "%_original%" -show_streams > ffprobe.txt  
 REM Parse FFprobe output to determine the number of audio channels  
 FOR /F "usebackq tokens=2 delims==" %%c in (`FINDSTR /C:"channels" ffprobe.txt`) DO (SET _audchan=%%c)  
 REM Run FFmpeg, using the %_audchan% variable  
 ffmpeg.exe -i "%_original%" -ac %_audchan% "%_preservation%.wav"  

Video Files: .flv .wmv .rv .rm .rmvb .mts --> .mp4 with h.264 encoding (FFmpeg)

 REM Use FFprobe to get more information about the recording  
 ffprobe.exe -loglevel panic "%_original%" -show_streams > ffprobe.txt  
 REM Parse FFprobe output to determine the number of audio channels  
 FOR /F "usebackq tokens=2 delims==" %%c in (`FINDSTR /C:"channels" ffprobe.txt`) DO (SET _audchan=%%c)  
 REM Run FFmpeg, using the %_audchan% variable  
 ffmpeg.exe -i "%_original%" -ac %_audchan% -vcodec libx264 "%_preservation%.wav"  

Legacy Word Processing Files: .wp .wpd  .cwk .sxw .uot .hwp .lwp .mcw .wn --> .odt (LibreOffice)

 REM Run LibreOffice as a service and listening on port 2002  
 START "Libre Office" /MIN "C:\Program Files (x86)\LibreOffice 4\program\soffice.exe" "-accept=socket,port=2002;urp;" --headless  
 REM Run DocumentConverter python script using the version of python included in LibreOffice.  
 "C:\Program Files (x86)\LibreOffice 4\program\python.exe" "%_original%" "%_preservation%.odt"  

This conversion requires the PyODConverter python script.

Microsoft Office Files: .doc .xls .ppt --> Office Open XML (OMPM)

This operation requires the installation of Microsoft's Office Compatibility Pack and Office Migration Planning Manager Update 1 (OMPM).  Before running, the C:\OMPM\Tools\ofc.ini file must be modified to reflect the "SourcePathTemplate" and the "DestinationPathTemplate" (examples are in the file).  Once modified, the OFC.EXE utility will run through and convert all legacy Office file formats to the 2010 version of Office Open XML with the following command:


Step 5: Arrangement, Packaging, and Description

This step involves a number of applications for conducting further reviews of content and also employs 7-Zip to package materials in .zip files and a custom Excel user form for recording descriptive and administrative metadata.  We'll explore this functionality in more depth in a later post.

Step 6: Transfer and Clean Up

To document the final state of the accession (especially if preservation copies have been created or materials have been packages in .zip files), we run DROID a final time.  After manually enabling the creation of md5 checksums, we employ the same commands as used before:

 REM Generate a DROID report  
 REM _procDir = processing directory  
 java -jar droid-command-line-6.1.5.jar -R -a "%_procDir%" -p droidProfile.droid   
 REM Export report to a CSV file  
 java -jar droid-command-line-6.1.5.jar -p droidProfile.droid -e DROID.csv   

We then use the Library of Congress's BagIt tool to 'bag' the fully processed material and then (to speed things up) copy it across the network to a secure dark archive using TeraCopy.

 REM _procDir = Processing directory  
 bagit-4.4\bin\bag baginplace %_procDir% --log-verbose   
 REM We then use TeraCopy to move the content to our dark archive location  
 teracopy.exe COPY %_procDir% %_destination% /CLOSE  

An additional copy of material will then be uploaded to Deep Blue, our DSpace repository.


I should also mention we record PREMIS event information for all preservation actions at the accession level.  Because I had no idea how to work with XML when we started this, we write the following elements to .csv files:
  • eventType: Name or title of the event (i.e., "virus scan").
  • eventIdentifierType: We're using UUIDs to identify events.
  • eventIdentifierValue: A UUID to uniquely identify the event.
  • eventDateTime: Timestamp for when the event concluded.
  • eventDetail: Note providing additional information for the event.
  • eventOutcome: Was the process completed?  (Completion indicates success.)
  • linkingAgentIdentifierType: We use MARC21 codes to identify agents.
  • linkingAgentIdentifierValue: MiU-H (Our MARC21 code.)
  • linkingAgentRole: Executor (i.e., the library executed this action).
  • linkingAgentIdentifierType: "Tool" (we use this second agent record to identify software used in events).
  • linkingAgentIdentifierValue: Name and version of software.
  • linkingAgentRole: "Software"
Click here for an example of one of these PREMIS spreadsheets.


So that's it--you can go get on with your day and I'll get back to this grant project.  Please get us a holler if you have any questions or suggestions as to how some of this could work more efficiently.  We're excited that before long all these CMD.EXE scripts will be a thing of the past, but they've treated us pretty well so far and maybe this post will help you out with some things.

Stay tuned for future posts on appraisal strategies, packaging techniques, ArchivesSpace, Archivematica, DSpace, and more!