Tuesday, May 26, 2015

ArchivesSpace Dating Advice

As Max detailed in his recent post on extents, there are some aspects of our EADs that are not necessarily wrong (i.e., won't cause any errors when importing into ArchivesSpace), but that are not optimized to take full advantage of potential reporting or searching functionality in ArchivesSpace or other systems. Whereas Max described some of the problems we have with our extent statements, this post will take a look at another aspect of our EADs that we initially thought would be a simple, easy, quick fix... until we learned more: dates.

Dates in our Current Finding Aids

Currently, our dates are encoded with <unitdate> tags in our EADs, but lack a "normal" attribute containing a normalized, machine-readable version of the date.

As an example, our dates might currently look like this: <unitdate type="inclusive">May 26, 2015</unitdate>
As opposed to this: <unitdate type="inclusive" normal="2015-05-26">May 26, 2015</unitdate>

Until now, this has not really been a problem. As you can see from an example such as the Adelaide J. Hart papers, our dates are presented to users as plain text in our current finding aid access sytem. Under the hood, those dates are encoded as <unitdate> elements, but our access system has no way to search or facet by date. As such, the access system has never needed a normalized, machine-readable form of dates. But what about ArchivesSpace?

Dates in ArchivesSpace

Before getting into what happens to our legacy <unitdate> elements when imported into ArchivesSpace, let's take a look at a blank ArchivesSpace date record.



Based on all of the date fields provided by ArchivesSpace, we can already see here that we're moving beyond plain text representation of our dates. Of particular interest for the purposes of this blog post are the fields for "Expression," "Begin," and "End." Hovering over the * next to "Expression" brings up the following explanation of what that field represents:



What this means is that the "Expression" field will essentially recreate the plain text, human-understandable version of dates that we have been using up until now. Simple enough.

Once we take a look at the "Begin" and "End" fields, however, we can start to see where our past practice and future ArchivesSpace possibilities come into conflict. The "Begin" and "End" fields give us the ability to record normalized-versions of our dates that ArchivesSpace (and other systems) can understand. This is definitely functionality that we will want to use going forward, but what does this mean for our legacy data?

Let's see what happens to our dates when we import one of our legacy EADs into ArchivesSpace.


The ArchivesSpace EAD importer took the contents of a <unitdate> tag and made a date expression of 1854-1888. It did not, however, make a begin date of 1854 or an end date of 1888. Why not? Lines 168-188 of the ArchivesSpace EAD importer can help us understand.


We'll get into a little bit more detail about making sense of the ArchivesSpace EAD importer in future posts about creating our custom EAD importer, but for now we'll take a higher-level view at what this portion of the EAD importer is going. What this bit of the EAD importer is doing is taking a <unitdate> tag and making an ArchivesSpace date record using various components of that <unitdate> tag and its related attributes. At line 178, the importer is making a date expression with the inner_xml of the <unitdate> tag, or the text within the open and closed <unitdate></unitdate> brackets, essentially recreating the plain text version of the dates that we currently have. But how is it making normalized begin and end dates?

On lines 180 and 181, the EAD importer is making a begin date with norm_dates[0] and an end date with norm_dates[1]. If we look at lines 170-174, we can see how those norm_dates are being made. The ArchivesSpace EAD importer is looking for a normal attribute (represented in the EAD importer as "att('normal')") in the <unitdate> tag and splitting the contents of that attribute on a forward slash to get the begin date (norm_dates[0]) and end date (norm_dates[1]).

In order for our example imported date above to have begin and end dates, the <unitdate> tag should look like this:

<unitdate type="inclusive" normal="1854/1888">1854-1888</unitdate>

Right now it looks like this:

<unitdate type="inclusive">1854-1888</unitdate>

Thankfully for us, making normalized versions of dates like the above is actually fairly simple.

Normalizing Common Dates

Similar to how there were many extents that could be cleaned up in one fell swoop, there are many dates that we can normalize by running a single script. The following script will make a normal attribute containing a normalized version of any <unitdate> that is a single year or a range of years. It will also add a certainty="approximate" attribute to any year or range of years that is not made up of only exact dates. Here, for easy reference, are examples of the attributes that the script adds to each of the possible manifestations of dates that are years or ranges of years:
  • A single year (1924): normal="1924"
  • A decade (1920s): normal="1920/1929" certainty="approximate"
  • A range of exact years (1921-1933): normal="1921/1933"
  • A range of a decade to an exact year (1920s-1934): normal="1920/1934" certainty="approximate"
  • A range of an exact year to a decade (1923-1930s): normal="1923/1939" certainty="approximate"
  • A range of a decade to a decade (1920s-1930s): normal="1920/1939" certainty="approximate"

And here is the script:


When this script is ran against our EADs, we get this result:


As you can see, this script added normal attributes to 316,578 of our 415,958 dates. In other words, this single script normalized about 75% of our dates, ensuring that ArchivesSpace will import date expressions, begin dates, and end dates for a majority of our legacy dates.

The Remaining 25% (and other surprises)

In future posts, we'll be going over how we've used OpenRefine to clean up the remaining 25% of our dates that could not be so easily automated, and we'll also be taking a look at some of the other surprising <unitdate> errors we've found lurking in our legacy EADs, including how we've identified and resolved those issues.


These are not the dates you're looking for.

No comments:

Post a Comment