Friday, June 19, 2015

Normalizing Dates with OpenRefine

A previous post discussed some of the work we've been doing to normalize the hundreds of thousands of dates in our EADs in preparation for ArchivesSpace migration. Specifically, the post detailed why such normalization is desirable and how we wrote a Python script to automate the process of normalizing 75% of our dates. But what about the remaining 25% that did not conform to the standard and easily-normalizeable forms of dates that we identified?

Enter OpenRefine

We've mentioned the prominent role that OpenRefine has taken in our legacy metadata cleanup on the blog before, but have not yet gone into detail about how we've integrated it into our legacy EAD cleanup project. Before we get into the specifics of using OpenRefine, we'll first need to get our dates into a format that is easy to work with in the tool.

OpenRefine supports TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents. Although EADs are XML documents, we did not want to have to normalize dates in each XML document individually in OpenRefine. Instead, we opted to output all of our non-normalized dates to a single CSV using Python, so that we could clean up our dates in OpenRefine in spreadsheet format.

The following Python script is what we've been using to output the non-normalized dates to a CSV:



This script outputs the following information for each non-normalized date to each row in the CSV: the filename of the EAD, the XPath (the unique position in the XML) of the element, and the text of the element. This is sufficient information for us to normalize the date and to then insert the normalized version back into the EAD as a 'normal' attribute.

Once the dates are in a CSV, a project can be created in OpenRefine, and the date normalization can begin.



Cleaning Data in OpenRefine: First Steps

The first step I like to take when working on a spreadsheet in OpenRefine is to create a working column for the data that I'll be working with, so that I can reference what the data looked like when it entered OpenRefine and compare that to the changes that I have made. Since I'll be working with the data in Column 3, all I have to do is click the triangle next to the Column 3 header, and select "Add column based on this column..." from the "Edit column" drop down menu. By default, OpenRefine will prompt you to create a new column with the same values as the original. Enter a name for the new column (I'll be calling mine 'expression' since that's what the plain text form of the date is called in ArchivesSpace) and click OK.

One of the first things I noticed about the contents of the CSV was that there were forms dates (such as 1820s-1896) that should have been normalized with the Python script that we used to automate the normalization process. Upon further inspection, the reason that the date did not get normalized before was that there was some leading whitespace between the tag and the date text. OpenRefine makes getting rid of leading (and trailing) whitespace easy: click on the triangle next to the column name and select "Trim leading and trailing whitespace" from the "Common transforms" drop down under the "Edit cells" menu. 



This transformation removed leading and trailing whitespace on 8,088 cells, which will make working with them much simpler.


Another thing we came to find out after working with our data for a while is that we have some fields in our EADs that have random extra spaces or line breaks within them. It's safe to assume (at least in the case of dates), that those extra spaces and/or line breaks are accidental. To remove them, and make the data even easier to work with, choose "Transform" from the "Edit cells" menu and enter the following:


This will replace all instances of multiple spaces with a single space, making splitting columns, faceting, filtering, and so on much more manageable going forward. In our case, this transformation removed unnecessary duplicate spacing in about 1,000 cells.

Normalizing Dates

The best way we've found to work with cleaning data in OpenRefine is to break the data down into smaller, uniform groups, rather than overwhelm ourselves by trying to normalize all 85,000 dates at once. OpenRefine's faceting and filtering capabilities allow us to identify all instances of a very particular form of date and then normalize all instances of that form. However, narrowing down to particular forms of dates requires first surveying the data that we have. Taking a quick look through some pages of our OpenRefine project shows that we have dates of the following forms:

May 1909-July 1968
January 24, 1960
ca. 1966
October 24-28, 1976
September-November 1986
1987/88
1/19/1958
l908 [yes, that's the letter L]
½0/2002 [yes, that's a fraction]
Photographs [no, that is not actually a date]

And on and on and on. There is no way for us to automate the normalization of all of these various forms of dates, but with OpenRefine we can at least break them down into manageable and uniform chunks.

For example, let's say we want to normalize all dates like "May 1909-July 1968," or of the form "Month YYYY-Month YYYY." First we click the triangle next to the column name 'expression', and select "Text filter" from the drop down menu. This gives us a text box to enter either exact text or a regular expression to match on. Almost all of the filtering we do is with regular expressions; it's worth becoming familiar with them. The form of date we're after (at least for these purposes) can be expressed in a regular expression as follows:


Filtering on that regular expression gives us 5,027 matching rows to work with. 



Ultimately, what we'll want to end up with for this particular form of date is a normalized version of the form YYYY-MM/YYYY-MM (our example from earlier, for instance, should end up as 1909-10/1968-07). To make this a little more manageable, we can use OpenRefine to split our existing column into four separate columns (begin month, begin year, end month, end year), transform each of those values into the desired format, and then rejoin the columns in the proper order and with the correct punctuation. To split the column into four columns, do the following:

  1. Click the triangle next to the column name and select "Split into several columns..." from the "Edit column" drop down.
  2. In the pop up box, enter "-" (a single dash, no quotes) as the separator and uncheck the boxes for "Guess cell type" and "Remove this column." The column should now be split into the first set of Month YYYY and the end Month YYYY (keeping up with our example, we should now have one column for May 1909 and another for July 1968).
  3. Split each of these new columns into several columns, this time entering " " (a single space, no quotes) as the separator and leaving the "Remove this column" box checked. 
  4. Rename the resulting columns something short and easily identifiable. I've renamed mine "bm," "by," "em," and "ey" (for begin month, begin year, end month, and end year respectively).

Our spreadsheet now looks like this:











The next step is to make sure that some of the assumptions we've made about our data so far are correct, in particular the assumption that all of the strings of letters we captured with our regular expression signify months. To find this out, again click the triangle next to the column names for months (in my case, 'bm' and 'em') and select "Text facet" from the "Facet" drop down. A list of all values and the amount of times they appear in that column will now display on the left of the project. In this example, the text facet has revealed that our column containing months has 19 values (i.e., 7 more than we would assume), mostly due to abbreviations. Each facet can be clicked on or edited from within the facet pane, making it very simple to change all instances of, for example, "Aug" to "August" at once.

Once the column values contain only the 12 correct and complete spellings of months that we're after, the next step is to transform all of the alphabetical representations of months into their respective numeric representations. To do this, select "Transform" from the "Edit cells" drop down within the columns menu and enter the following replacement formula:



This is a change I make quite frequently, so I have this transformation formula saved for easy reference. Here it is as plain text for easy copying and pasting:

value.replace('January', '01').replace('February','02').replace('March', '03').replace('April', '04').replace('May', '05').replace('June', '06').replace('July','07').replace('August', '08').replace('September','09').replace('October', '10').replace('November','11').replace('December', '12')

Once we do that with our begin and end month columns, we're ready to join our months and years back together as the final normalized version. Open the drop down menu for any column and select "Add column based on this column..." from the "Edit column" drop down. Enter the name for the new column (I'll call mine "normal"), and enter the following in the "Expression" field:


Again, for easy copying and pasting: 

cells['by'].value + '-' + cells['bm'].value+ '/' + cells['ey'].value + '-' + cells['em'].value

This string merges the values of the four separate columns in the proper order and separated by the proper punctuation to form an ArchivesSpace acceptable normalized date.

As laid out in this post, the process of normalizing dates in OpenRefine can seem somewhat tedious. However, once you get the hang of a few simple functions in OpenRefine, you can start to quickly use them to isolate, clean, and normalize large chunks of data fairly quickly. We will undoubtedly talk about additional work we've done using OpenRefine in future posts, but in the meantime the OpenRefine wiki and documentation on GitHub is an excellent resource for further instruction about the powerful things the tool can do. In the coming weeks we'll also be detailing how we take our cleaned up data from OpenRefine and reinserted it back into our EADs. Stay tuned!






2 comments:

  1. Really interesting post - thanks for writing this.

    In case it is helpful, another strategy you can adopt with dates in OpenRefine is using the 'toDate' function. This converts things into OpenRefine's Date format. Once you have a date in this format you can convert it back into a string specifying the format you want for the string. When you us 'toDate' on a value OpenRefine will do its best to recognise the starting format, but sometimes you have to prompt it by including the format in the command.

    To take a simple example if you start with "10 May 1976" you can use:

    value.toDate("dd MMM yyyy").toString("yyyy-MM-dd")

    The 'dd MMM yyyy' tells OpenRefine the format of the date you are starting with and the 'yyyy-MM-dd' tells it the format you want to end up with - so in this case you'd get "1976-05-10".

    To work a real example - you have "May 1909-July 1968" and want "1909-05/1968-07". You can do this with:

    forEach(value.split("-"),v,v.toDate("MMM yyyy").toString("yyyy-MM")).join("/")

    This splits the original string into an array (using 'split') then for each member of the array it tries to convert it to a Date, then back to the correctly formatted string.

    Thanks again for the post

    ReplyDelete
  2. Thanks for the comment! Glad you enjoyed the post.

    That is a really helpful tip. I experimented a bit with the toDate function initially, but OpenRefine did not seem to reliably and accurately recognize the starting format. If I had known that you could include the format in the command, and then converted the date back into a string, it could have saved me some time and effort, at least with some of the more accurate and uniform kinds of dates.

    One of the benefits of using the somewhat more labor intensive method described in this post was that I was able to take a really close look at the dates by faceting on months, years, days, etc. to seek out incorrect data. We had a lot of dates that were invalid (e.g., "April 31"), others that were incorrectly typed (e.g., the begin date came after the end date). and others still that were not dates at all (e.g., four-digit addresses that had been improperly identified as dates by an encoding script many years ago). I'm not sure how toDate and toString would have handled those, but I will definitely be putting those functions to use for projects like this in the future.

    Thanks again for the great comment!

    ReplyDelete