Dates in our Current Finding Aids
Currently, our dates are encoded with <unitdate> tags in our EADs, but lack a "normal" attribute containing a normalized, machine-readable version of the date.
As an example, our dates might currently look like this: <unitdate type="inclusive">May 26, 2015</unitdate>
As opposed to this: <unitdate type="inclusive" normal="2015-05-26">May 26, 2015</unitdate>
Until now, this has not really been a problem. As you can see from an example such as the Adelaide J. Hart papers, our dates are presented to users as plain text in our current finding aid access sytem. Under the hood, those dates are encoded as <unitdate> elements, but our access system has no way to search or facet by date. As such, the access system has never needed a normalized, machine-readable form of dates. But what about ArchivesSpace?
Dates in ArchivesSpace
Before getting into what happens to our legacy <unitdate> elements when imported into ArchivesSpace, let's take a look at a blank ArchivesSpace date record.
Based on all of the date fields provided by ArchivesSpace, we can already see here that we're moving beyond plain text representation of our dates. Of particular interest for the purposes of this blog post are the fields for "Expression," "Begin," and "End." Hovering over the * next to "Expression" brings up the following explanation of what that field represents:
What this means is that the "Expression" field will essentially recreate the plain text, human-understandable version of dates that we have been using up until now. Simple enough.
Once we take a look at the "Begin" and "End" fields, however, we can start to see where our past practice and future ArchivesSpace possibilities come into conflict. The "Begin" and "End" fields give us the ability to record normalized-versions of our dates that ArchivesSpace (and other systems) can understand. This is definitely functionality that we will want to use going forward, but what does this mean for our legacy data?
Let's see what happens to our dates when we import one of our legacy EADs into ArchivesSpace.
The ArchivesSpace EAD importer took the contents of a <unitdate> tag and made a date expression of 1854-1888. It did not, however, make a begin date of 1854 or an end date of 1888. Why not? Lines 168-188 of the ArchivesSpace EAD importer can help us understand.
We'll get into a little bit more detail about making sense of the ArchivesSpace EAD importer in future posts about creating our custom EAD importer, but for now we'll take a higher-level view at what this portion of the EAD importer is going. What this bit of the EAD importer is doing is taking a <unitdate> tag and making an ArchivesSpace date record using various components of that <unitdate> tag and its related attributes. At line 178, the importer is making a date expression with the inner_xml of the <unitdate> tag, or the text within the open and closed <unitdate></unitdate> brackets, essentially recreating the plain text version of the dates that we currently have. But how is it making normalized begin and end dates?
On lines 180 and 181, the EAD importer is making a begin date with norm_dates[0] and an end date with norm_dates[1]. If we look at lines 170-174, we can see how those norm_dates are being made. The ArchivesSpace EAD importer is looking for a normal attribute (represented in the EAD importer as "att('normal')") in the <unitdate> tag and splitting the contents of that attribute on a forward slash to get the begin date (norm_dates[0]) and end date (norm_dates[1]).
In order for our example imported date above to have begin and end dates, the <unitdate> tag should look like this:
<unitdate type="inclusive" normal="1854/1888">1854-1888</unitdate>
Right now it looks like this:
<unitdate type="inclusive">1854-1888</unitdate>
Thankfully for us, making normalized versions of dates like the above is actually fairly simple.
Normalizing Common Dates
Similar to how there were many extents that could be cleaned up in one fell swoop, there are many dates that we can normalize by running a single script. The following script will make a normal attribute containing a normalized version of any <unitdate> that is a single year or a range of years. It will also add a certainty="approximate" attribute to any year or range of years that is not made up of only exact dates. Here, for easy reference, are examples of the attributes that the script adds to each of the possible manifestations of dates that are years or ranges of years:
- A single year (1924): normal="1924"
- A decade (1920s): normal="1920/1929" certainty="approximate"
- A range of exact years (1921-1933): normal="1921/1933"
- A range of a decade to an exact year (1920s-1934): normal="1920/1934" certainty="approximate"
- A range of an exact year to a decade (1923-1930s): normal="1923/1939" certainty="approximate"
- A range of a decade to a decade (1920s-1930s): normal="1920/1939" certainty="approximate"
And here is the script:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#import what we need | |
import lxml | |
from lxml import etree | |
import os | |
from os.path import join | |
import re | |
path = 'path/to/EADs' #<-- Change this to your EAD directory path | |
# Make some regular expressions | |
yyyy = re.compile('^\d{4}$') # Ex: 1920 | |
yyyys = re.compile('^\d{4}s$') # Ex: 1920s | |
yyyy_yyyy = re.compile('^\d{4}\-\d{4}$') # Ex: 1920-1930 | |
yyyys_yyyy = re.compile('^\d{4}s\-\d{4}$') # Ex: 1920s-1930 | |
yyyy_yyyys = re.compile('^\d{4}\-\d{4}s$') # Ex: 1920-1930s | |
yyyys_yyyys = re.compile('^\d{4}s\-\d{4}s$') # Ex: 1920s-1930s | |
# Initialize these values to keep track of how many dates we've normalized | |
normalized_dates = 0 | |
not_normalized_dates = 0 | |
for filename in os.listdir(path): | |
print filename # Print the filename that is currently being checked. This is helpful for identifying errors. | |
tree = etree.parse(join(path, filename)) | |
# xpath that checks for a <unitdate> anywhere in the EAD | |
dates = tree.xpath('//unitdate') | |
# loop through each <unitdate> | |
for i in dates: | |
if i.text and len(i.text) > 0: | |
# check if the content of <unitdate> matches any of those regular expressions | |
if yyyy.match(i.text) and len(i.text) == 4: # We also verify that the length is what we would expect based on the regular expression for an added level of certainty that these really are the kinds of dates we're looking for | |
i.attrib['normal'] = i.text # Dates like "1920" don't need to be changed at all to make a normalized version | |
normalized_dates += 1 | |
elif yyyys.match(i.text) and len(i.text) == 5: | |
i.attrib['normal'] = i.text.replace('s', '') + '/' + i.text[:3] + '9' # Change dates like "1920s" to "1920/1929" | |
i.attrib['certainty'] = "approximate" # Since this is a date range and not an exact date, add an "approximate" certainty attribute | |
normalized_dates += 1 | |
elif yyyy_yyyy.match(i.text) and len(i.text) == 9: | |
i.attrib['normal'] = i.text.replace('-', '/') # Dates like "1920-1930" are easy: simply replae the '-' with a '/' to get "1920/1930" | |
normalized_dates += 1 | |
elif yyyys_yyyy.match(i.text) and len(i.text) == 10: | |
i.attrib['normal'] = i.text.replace('-', '/').replace('s', '') # "1920s-1930" becomes "1920/1930" by dropping the 's' and changing the '-' to a '/' | |
i.attrib['certainty'] = "approximate" | |
normalized_dates += 1 | |
elif yyyy_yyyys.match(i.text) and len(i.text) == 10: | |
normalized = i.text.replace('-', '/') # For dates like "1920-1930s", first replace the '-' with a '/' to get "1920/1930s" | |
normalized = normalized.replace(normalized[-2:], '9') # Now replace the last two characters with '9', yielding "1920/1939" | |
i.attrib['normal'] = normalized | |
i.attrib['certainty'] = "approximate" | |
normalized_dates += 1 | |
elif yyyys_yyyys.match(i.text) and len(i.text) == 11: | |
normalized = i.text.replace('-', '/').replace('s', '', 1) # For dates like "1920s-1930s', replace the '-' with a '/' and remove ONLY the first 's' to get "1920/1930s" | |
normalized = normalized.replace(normalized[-2:], '9') # Now replace the last to characters with '9', yielding "1920/1939" | |
i.attrib['normal'] = normalized | |
i.attrib['certainty'] = "approximate" | |
normalized_dates += 1 | |
else: | |
not_normalized_dates +=1 | |
continue | |
else: | |
not_normalized_dates += 1 | |
continue | |
outfilepath = 'path/to/new/EADs' #<-- Change this to a different directory than the one you started with in case anything goes wrong. You don't want to overwrite your original EADs. | |
outfile = open((join(outfilepath, filename)), 'w') | |
outfile.write(etree.tostring(tree, encoding="utf-8", xml_declaration=True)) # Write the new version of the EAD with normalized dates! | |
outfile.close() | |
# Add up our normalized_dates and not_normalized_dates to get the total dates checked | |
total_dates = normalized_dates + not_normalized_dates | |
# Print the results of our normalization attempt | |
print "Normalization attempted on " + str(total_dates) + " dates" | |
print "Number of dates normalized: " + str(normalized_dates) | |
print "Number of dates not normalized: " + str(not_normalized_dates) |
When this script is ran against our EADs, we get this result:
As you can see, this script added normal attributes to 316,578 of our 415,958 dates. In other words, this single script normalized about 75% of our dates, ensuring that ArchivesSpace will import date expressions, begin dates, and end dates for a majority of our legacy dates.
The Remaining 25% (and other surprises)
In future posts, we'll be going over how we've used OpenRefine to clean up the remaining 25% of our dates that could not be so easily automated, and we'll also be taking a look at some of the other surprising <unitdate> errors we've found lurking in our legacy EADs, including how we've identified and resolved those issues.
These are not the dates you're looking for. |
No comments:
Post a Comment