Friday, May 22, 2015

Exten(t)uating Circumstances: 80 Years of Descriptive Practices and the Long Tail(s) of Extents

It all started with a simple error:

Error: #<:ValidationException: {:errors=>{"extents"=>["At least 1 item(s) is required"]}}>

This is the error we got when we tried to import EADs into ArchivesSpace with extent statements that began with text, such as "ca." or "approx." So ArchivesSpace likes extent statements that begin with numbers. Fine. Easy fix. Problem solved.

And it was an easy fix... until we started getting curious.

The Extent (Get It!) of the Problem

As we did our original tests importing legacy EADs into ArchivesSpace (thanks, Dallas!), we started noticing that extents weren't importing quite the way we expected. As it turns out, ArchivesSpace imports the entire statement from EAD's <physdesc><extent> element as the "Whole" extent, with the first number in the statement imported as the "Number"  of the extent and the remainder of the statement imported as the "Type":

An Extent in ArchivesSpace

This results in issues such as the one above, where the number imports fine, but type imports incorrectly. "linear feet and 7.62 MB (online)" is actually a Type plus another extent statement with its own Number, Type and Container Summary. This would be more accurately represented by breaking the extent into two "Part" portions.

This also makes for a very dirty "Type" dropdown list:

I've highlighted the only type that should really be there.

Now, this isn't actually a problem for import to ArchivesSpace. But it is a problem. In the end, we decided to take a closer look at extents to clean them up. That's fun, right? In hindsight, our initial excitement about this was probably a little naive. We were dealing with 80 years of highly varied descriptive practices, after all.

Getting Extents

In his last post, Dallas started to detail how we "get" elements from EADs ("get" here means go through our EADs, grab extent(s), and print them with their filename and location to a CSV for closer inspection and cleaning). In case you're wondering how exactly we did got extents, here is our code (and feel free to improve it!):


 # import what we need  
 import lxml  
 from lxml import etree  
 import csv  
 import os  
 from os.path import join  
 # where are the eads?  
 ead_path = 'path/to/EADs' # <-- you have to change this  
 # where is the output csv?  
 output_csv = 'path/to/output.csv' # <-- you have to change this  
 # "top level" extents xpath  
 extents_xpath = '//ead/archdesc/did//physdesc/extent'  
 # component extents xpath  
 component_extents_xpath = '//ead/archdesc/dsc//physdesc/extent'  
 # all extents xpath  
 all_extents = '//extent'  
 # open and write header row of csv  
 with open(output_csv, 'ab') as csv_file:  
   writer = csv.writer(csv_file, dialect='excel')  
   writer.writerow(['Filename', 'XPath', 'Original Extent'])  
 # creates a function to get extents  
 def getextents(xpath):  
   # go through those files  
   for filename in os.listdir(ead_path):  
     tree = etree.parse(join(ead_path, filename))  
     # keep up with where we are  
     print "Processing ", filename  
     # parse and go through all component extents  
     extents = tree.xpath(xpath)  
     for i in extents:  
       # identify blank extents  
       extent = i.text  
       extent_path = tree.getpath(i)  
       with open(output_csv, 'ab') as csvfile:  
         writer = csv.writer(csvfile, dialect='excel')  
           writer.writerow([filename, extent_path, extent])  
           writer.writerow([filename, extent_path, 'ISSUE EXTENT'])  
 # close the csv  
 # get extents      
 getextents(all_extents) # <-- you'll have to change this to get the extents you want, "top level," component level or all (i want all)  

We weren't exactly thrilled with what we found.

The Long Tail(s) of Exents

Our intern, Walker Boyle, put together a histogram of what we found for both extents and component extents, and I converted them into graphs. You need to click them to get the full effect.



How We're Thinking About Fixing Extents (How Comes Later)

As you can see, we had a bit of a problem on our hands. Our extents are very dirty (perhaps that's an understatement!). We decided to go back to square one. Lead Archivist for Description and Workflow Management Olga Virakhovskaya and I sat down to try to at least come up with a short list of extent types. For just the top level extents (2800+), this was a 3 1/2 hour process (3 1/2 hours!). We didn't even want to think about how long it would take to go through the nearly 59,000 component-level extents. (I just did the math. It would take two business weeks). To make matters worse, by the end of our session, we realized that our thoughts about extents were evolving, and that the list we started creating at the beginning was different than the list we were creating at the end.

Frustrated, we got back together with the A-Team to discuss further and deliberated on the following topics.


Our first thought was to turn to Describing Archives: A Content Standard, or DACS. However, it turns out that DACS is pretty loosey-goosey it comes to DACS, especially the section on Multiple Statements of Extent:

These examples are all over the place!

Needless to say, this didn't help us much.

Human Readable vs. Machine-Actionable Extents

We realized that part of the issue arises from the fact that for pretty much our entire history the text of extent statements has been recorded for the human eyes that will be looking at them, and for those eyes only. ArchivesSpace affords the opportunity for this information to be much more granular and machine readable (and therefore potentially machine-actionable). For instance, we've thought that perhaps we could bring together all extents of a certain Type and add their numbers together to get a total. This wouldn't have been possible before but it might be in ArchivesSpace depending on how well we clean up the extents.

To oversimplify, we decided (at least for the time being) that as we normalize extents we'd like to find a happy medium between flexibility and human-readableness on the one hand, and potential machine-actionability (and consistency for consistency's sake) on the other.

Why Are We Recording This Information Again?

Finally, as with many things in library- and archives-land, every once in a while you find yourself asking, "Why are we doing this again?" This case was no different. We started to really ask ourselves why we were recording this information in the first place, hoping that would inform the creation of a shortlist and a way to move forward.

We turned to user stories to try to figure out the ways that extents might or could get used. That is, not the way they have been or do get used, or even how they will get used, but all the ways they might get used. We thought of these:

First, from the perspective of a researcher...

  1. As a researcher, I need to be able to look at a collection's description and be able to tell quickly how large it is so that I know if I should plan to stay an hour or a week, or look at a portion of a collection or the whole thing.
  2. As a researcher, I'm looking for specific materials (photographs, drawings, audio recordings, etc.) 
  3. As an inexperienced researcher, I don’t know that this information may be found in Scope and Content notes.

And from the perspective of archivists...

  1. As an archivist, I’d like to know how much digital material I have, how much is unique (i.e., born-digital), and how much is not (digitized). This would also be true for microfilmed material.
  2. As an archivist, I need a way to know how much (and what kind) of material I have (e.g., 3,000 audiocassettes; 5,000 VHS tapes, &c.).
  3. As a curation archivist, I need an easy way to distinguish between different types of film across collections (e.g., 8 mm, 16 mm, 35 mm, 2-inch) because the vendor we've selected for digitization only does one or some of these types.
  4. As a curation archivist, I’m working on better a locations/stacks management system. I need to know the physical volume of holdings and the types of physical formats and sizes. 
  5. As a curation archivist, I need a way to know which legacy collections contain obsolete storage media (such as floppy disks of different sizes) so that I can process this digital material, or decide on equipment purchases.
  6. As a reference archivist, I need an easy way to distinguish between different types of film in a collection so that I know whether we have the equipment on site for researchers to view this material.

As you can see, this is a lot to think about!

The Solution

I know you'd really like to know our solution. Well, we've taken care of the easy ones:

Other than the easy ones, however, progress is slow. We're continuing to try to create user stories to inform our thinking, to create a short list of extent types, and to make plans for addressing common extent type issues.

A future post will detail some of the OpenRefine magic we're doing to clean up extents, and another will explain exactly how we're handling these issues and reintegrating them back into the original EADs, code snippets and all. Stay tuned!

In the meantime, why not leave a comment and let us know how and why you use extents!


  1. Have you considered doing this kind of cleanup using XSLT? XSLT (especially v2) is well suited to the job.

    1. That's a great suggestion! And timely! Mike actually just finished up an ARL and DLF workshop entitled "Transforming Library Metadata with XSLT":

      We're eager to put what he's learned to work transforming our XML. Do you have any resources you'd recommend for those of us just getting started with XSLT?