Tuesday, April 28, 2015

Legacy EAD Import into ArchivesSpace

As previously detailed by Max, the Bentley Historical Library is in the process of implementing ArchivesSpace as a component of its ArchivesSpace-Archivematica-DSpace Workflow Integration Project.

One of the requirements for using ArchivesSpace to manage our accession and collection metadata going forward is to migrate all of our legacy data into the new system. Up until now, the Bentley has been managing its accessions in a FileMaker Pro database, and its finding aids are all encoded as EAD (Encoded Archival Description). This creates some challenges for migrating that legacy data into ArchivesSpace.

As ArchivesSpace was formed as a merger between Archivists' Toolkit and Archon, institutions that were previously using one of those collection management systems can make use of migration tools to move information from the old system into ArchivesSpace. Institutions that were not using Archivists' Toolkit or Archon, such as the Bentley, must migrate their data using several standard formats, including CSV for accessions and EAD or MARCXML for collections. At the Bentley, we have chosen to start our migration to ArchivesSpace by first focusing on our EAD finding aids.

Legacy EAD Import Testing

Due to the varieties in practice allowed by EAD, and the more specific requirements enforced by ArchivesSpace, the migration of legacy collections data from EAD into ArchivesSpace is not as simple as starting a batch import job and ending up with all of that data properly imported into ArchivesSpace out of the box.

In fall 2014, I conducted a student practicum at the Bentley Historical Library investigating the importation of some of the Bentley's legacy EAD finding aids into ArchivesSpace, focusing primarily on identifying issues that exist between the stock ArchivesSpace EAD importer and the Bentley's legacy encoding and descriptive practices.

Altogether, I tested 166 finding aids, formed from a representative sample of the Bentley's approximately 3,000 EADs. The EADs that were tested were chosen to represent as much of the variety in the Bentley's legacy EADs as possible, as well as some that were chosen as being potentially likely to have some compatibility issues.

Of the 166 EADs that I tested, 107 imported successfully and 59 had errors on the initial import attempt, for an error rate of 35.34%. While this error rate could certainly be worse, it scales to roughly 1,000 of the Bentley's EADs not importing successfully into ArchivesSpace. A close examination of the errors that were encountered was necessary in order to identify some potential solutions.

Errors

During the testing process, several specific types of errors became apparent. It is worth noting that many of the errors are a result of ArchivesSpace, not EAD or DACS (Describing Archives: A Content Standard), requirements. The Bentley's legacy EADs are all valid EAD and conform to existing descriptive standards, and oftentimes contain the information that ArchivesSpace requires, just not necessarily in the exact place or in the exact form that ArchivesSpace expects.

The most common type of error that I encountered was that the Bentley's EADs do not always supply information that is required by ArchivesSpace in a way that the ArchivesSpace EAD importer understands. For example, the most common error (accounting for nearly half of all errors) was the result of digital objects (in the form of <dao> tags in the EAD) being imported into ArchivesSpace without titles. Digital objects in ArchivesSpace require titles, and the stock ArchivesSpace EAD importer looks for titles in the title attribute of <dao> tags. The practice at the Bentley, however, has been to indicate digital object titles in a <unititle> tag; the title is there, it just isn't importing properly into ArchivesSpace.

Another common type of error was incompatibilities in the way some fields in the Bentley's EADs were structured or the way that some of the content within the fields was supplied. A common example of this type of error can be found in some of our extent statements. ArchivesSpace requires extent statements to be formatted as a number followed by letters, such as "2 linear feet." Some of the Bentley's extent statements, however, begin as letters followed by a number, such as "ca. 1000 linear feet." ArchivesSpace is not designed to allow these types of extent statements, so EADs that contain such statements return an error during the import process.

A detailed breakdown of all of the errors encountered during legacy EAD import testing is as follows:
  1. Digital objects missing title attributes: 28 occurrences; 47.86% of errors
  2. Indices not formatted in accordance with ArchivesSpace requirements: 14 occurrences; 23.73% of errors
  3. Component-level descriptions missing either a title or a date: 13 occurrences; 22.03% of errors
  4. Extent types not conforming to ArchivesSpace specifications: 7 occurrences; 11.86% of errors
  5. Extent statement formatting not conforming to ArchivesSpace specifications: 5 occurrences; 8.47% of errors
  6. Container tags improperly formatted: 3 occurrences; 5.08% of errors
  7. Unidentified archival_object error: 2 occurrences; 3.39% of errors
  8. Empty <unitdate> tags: 2 occurrences; 3.39% of errors
  9. Unidentified file_version error: 1 occurence; 1.69% of errors
  10. Character encoding (in this instance, a right curly quote instead of a straight quote): 1 occurrence, 1.69% of errors
  11. Invalid EAD doctype definition: 1 occurrence; 1.69% of errors
  12. Empty <unitid> tag: 1 occurrence; 1.69% of errors
Notice those two "unidentified" errors (numbers 7 and 9) above? One of the biggest obstacles in determining what was causing an error during the EAD import testing was parsing the ArchivesSpace error messages. The error messages, at least in ArchivesSpace version 1.0.9, did not point at a particular line in the EAD that was causing the error. Rather, the error messages are based on where the error occurred in the conversion of the EAD to the ArchivesSpace JSONModel.

Some of the error messages were fairly easy to understand, such as the following:

Error: Problem creating 'American Civil Liberties Union of Washtenaw County Records
1961-2000': id_0 That ID is already in use, ead_id Must be unique
What this says is that there is an existing resource with the same ead_id (likely as a result of the same EAD being imported previously), and that ead_ids must be unique. Simple enough.

However, other error messages are not quite so helpful, such as the error message for the "Unidentified archival_object error":
Error: Unexpected Object Type in Queue: Expected archival_object got file_version
Despite much initial confusion, however, I was eventually able to understand most of the error messages provided by the ArchivesSpace EAD importer, which provided a great deal of guidance in identifying potential strategies for moving forward with our legacy EAD migration.

Solutions

Once I had a list of all of the known compatibility issues between our EADs and the ArchivesSpace EAD importer, it was clear that there was much work to be done to make our EADs and ArchivesSpace work well together. In addition, beyond the error messages described in this post, there are numerous examples of fields in our EADs that import successfully, but not quite in the way we want the data to be in ArchivesSpace going forward (posts on those additional concerns forthcoming!). In order to migrate our EAD finding aids successfully, and with all of the data mapped as we would like, some changes are necessary to the ArchivesSpace importer and in our legacy encoded data, which will be detailed in future posts.

Ultimately, the challenges posed by migrating our legacy data into ArchivesSpace pales in comparison to the benefits and opportunities that will be afforded to us once the process is complete. The end result of the migration process will allow us to manage information about our collections in a single, standardized, community-supported tool, something about which we are very excited. We'll be sharing some detailed information about some of the solutions we've come up with to migrate our legacy data in later posts, including details about our own ArchivesSpace plugin, some custom Python scripts, and the use of OpenRefine. Until then, it is worth noting two principles that have helped guide us through this process:

1. View the ArchivesSpace migration as an unprecedented opportunity to cleanup legacy metadata.
We are working with some EADs that were created years ago, and it's safe to assume that it will be a while until the opportunity arises do metadata cleanup on this scale again.

2. Automate legacy metadata cleanup and ArchivesSpace error resolution as much as possible.
In the process of migrating our legacy EADs to ArchivesSpace, we have spent a good amount of time and effort improving our skills in programming, working with XML, and working with existing metadata cleanup tools. Improving our ability to automate some of this work has greatly enhanced our efficiency, given us the ability to quickly resolve some major issues, and increased our ability to focus on additional problems and concerns as they have arisen.

2 comments: