Thursday, September 3, 2015

The ArchivesSpace API

One of the most powerful tools that we've been making use of in our task to migrate our legacy EADs and accession records to ArchivesSpace is the ArchivesSpace API (Application Programming Interface), which allows users to interact with the ArchivesSpace backend to more efficiently and programmatically accomplish any task that can be completed in the ArchivesSpace frontend, including creating, updating, and deleting accessions, resources, archival objects, digital objects, subjects, agents, and other records. While this post will include a very brief introduction to using the ArchivesSpace API, if you are truly a newcomer to APIs in general and to the ArchivesSpace API in particular (as we were just a few months ago!), there is no better place to start than the ArchivesSpace Developer Screencasts put together by Hudson Molonglo, in particular screencast 2, Backend Introduction. Also, while this post will focus on only a few of the available ArchivesSpace API endpoints, all of the available endpoints are detailed in the ArchivesSpace API documentation.

Using the ArchivesSpace API

A major benefit of the ArchivesSpace API is that it allows users to interact with the ArchivesSpace application without having to modify the core application code or use ArchivesSpace's programming language, Ruby. This is great for us as, while we have learned enough Ruby to write some basic ArchivesSpace plug-ins, most of the programmatic work we've been doing for this project has been written in Python. Utilizing the ArchivesSpace API allows us to continue to use a programming language that we are more familiar with, especially with respect to accessing and modifying our legacy data, to interact directly with the ArchivesSpace application. However, while most of the scripts that will be detailed in this post are written in Python, an easier way to get started interacting with the ArchivesSpace API is by using curl in a Mac or Linux terminal or in a Windows Unix emulator such as Cygwin.

Once you've opened up a terminal or Cygwin with curl installed, you can send a simple request to the ArchivesSpace backend to test that the backend API is available. If you're running ArchivesSpace on a test instance on your local computer (which I highly recommend when experimenting with the API, and with the application in general), that request and the resulting response look something like this:



If you're interacting with an ArchivesSpace instance that is not running on your local machine, substitute your ArchivesSpace instance url for localhost and the port on which the backend is running for 8089.

Most of the really powerful things that can be done with the API require users to verify that they have the permissions to do so, so once you've verified that you can communicate with the ArchivesSpace API, the next step is to authenticate and start a session. To authenticate using the default administrator username and password, the request and first part of the response looks like this:



This request returns a longer response than the first one we sent, including the session token that you will need to include in a header that you must send with every subsequent request. Since the session token is a really really long string, it makes things a lot easier if you store the token as a variable, like so:


Subsequent requests sent to the API should include the session token in the header, like this:


This tells ArchivesSpace that you are an authenticated user who is allowed to do all sorts of really powerful and potentially dangerous things. Again, always test your code on a test instance of ArchivesSpace!

The majority of the actions that can be completed via the API take the form of either HTTP get or post requests. As the names may imply, get requests return some data to the user and post requests submit some data to the application. Get and post requests can often be sent to the same backend endpoint, with get requests including the particular ID of a desired record and post requests including the data (in ArchivesSpace JSONModel format) of the record to be created. Here are some quick examples:

A get request that returns the IDs of all resources in repository 2

A get request that returns the JSON representation of resource 3

A get request that returns the JSON representation of subject 1

A post request that creates a new subject. The API returns a  bit of JSON including the ID and uri of the posted subject

The new subject posted via API as seen in the ArchivesSpace staff interface

Authenticating via the API is about as far as the API documentation goes in terms of providing detailed instructions for beginners. Figuring out how to use all of the available API endpoints to your advantage can occasionally be an easy process but, if you're like me, it often involves a lot of trial and error and head scratching over exactly the request, including data and parameters, the ArchivesSpace API requires for each endpoint. We are by no means expert users of the ArchivesSpace API, but we have figured out some really awesome solutions for our particular problems by getting just far enough in our understanding of some of the endpoints. The rest of this post will be spent documenting a few of the projects we've completed using the API. If you're interested in reading about other great work that archivists have done using other API endpoints, check out Maureen Callahan's post on deleting records and Hillel Arnold's post on automating exports using the API.

Disclaimer: All of the following examples are based on our very particular use cases, legacy data, and programming expertise or lack thereof. As such, the exact workflows and Python scripts shared herein will likely not be applicable to most other institutions and their data. Rather, they are intended to serve as examples of what is possible via the ArchivesSpace API and to provide some guidance on how certain endpoints can be used.

Creating Digital Objects

The idea for this script came from a conversation with a colleague about the possibility of automating the creation of digital objects in ArchivesSpace for digitized archival objects using a spreadsheet inventory of a collection containing the ArchivesSpace ref ID for each archival object, a barcode or other sort of identifier for the digitized content, and the url for the digitized content. Such a spreadsheet could easily be created using an ArchiveSpace exported EAD, which contains the ArchivesSpace Ref ID for each archival object as a component level id attribute:

An ArchiveSpace archival object. Note the Ref ID.

That same archival object in an ArchivesSpace exported EAD. Note the <c02> id attribute.

The series of API requests to create a new digital object and link it to the existing archival object goes like this:

1. Using the archival object Ref ID, search ArchivesSpace for the archival object's uri using the search endpoint


This returns a JSON representation of the search results. Since Ref IDs should be unique, there should only be one search result, containing the following bit of information that we're after:

The uri to the archival object that matches the searched for Ref ID
2. Using the archival object uri from the search results, get the JSON representation for that archival object using the get archival object endpoint


This returns the JSON representation for the archival object, which we can store, add a new instance to, and repost to update the archival object using the API

3. Using the archival object's display string (a concatenation of its title and date) from the archival object JSON and the identifier and digital object uri from the spreadsheet, form the JSON for a new ArchivesSpace digital object and post it using the post digital object endpoint


This returns a JSON message containing the uri for the newly created digital object

The posted digital object

4. Using the uri for the new digital object, create a new archival object instance and add it to the archival object JSON
5. Post the updated archival object JSON to the archival object's uri to update the archival object in ArchivesSpace

I've actually only ever done those last two steps in the Python script that I wrote to automate this whole process. That script can be found here, and here are links to the lines corresponding to each of the above steps to see how it's done in Python instead of curl: [1] [2] [3] [4] [5]

As you may be able to tell, curl is really useful for single, simple interactions with the API, and is helpful for testing some of the API endpoints to see how they work. If you have a series of API calls that will need to be strung together and repeated over and over again, it's much easier to do that using a programming language like Python and its requests library.

Creating Subjects and Agents

As we've mentioned many times on this blog before, we are migrating our legacy descriptive metadata to ArchivesSpace using our EADs. As such, one of the limitations we have faced is the stock ArchivesSpace EAD importer, which works for the most part but is not exactly what we need for the data we have. Our solution to that has been our huge EAD cleanup project that we've detailed on this blog in combination with some local modifications to the EAD importer. But what happens when the issue is not due to messy data or to the ArchivesSpace EAD importer, but to the limitations of EAD itself?

EAD 2002 does not have a way of representing subdivided subjects (e.g., Ann Arbor--Dwellings.) or the various components that make up agents (primary name, rest of name, dates of existence) to the same level of granularity of ArchivesSpace (EAD3 will help!). Take a look at this <geogname> in our EAD for instance:


This is imported into ArchivesSpace like this:



When really it should look like this:


We want our data to be migrated to ArchivesSpace as cleanly and correctly as possible and, while subdivided subjects might seem like not-such-a-big-deal, we plan to use ArchivesSpace to export MARC XML records for our collections, we will ultimately want to take advantage of the functionality of EAD3, and new subjects will likely be created in ArchivesSpace following the example in the second ArchivesSpace image above, so now is really the best time to ensure that our legacy subjects will be migrated to ArchivesSpace properly. Enter the API.

Posting subjects via the API is actually really simple (see the example using curl way near the top of this post). What was REALLY complicated about the process of using the API to post our subjects is that a term type is required for each individual term. Since EAD does not have the structure to support multiple terms, much less term types, the HIGHLY messy process that we used looks like this:


1. Agonize over the apparent hopelessness of the issue for a little while, until we realize that we have MARC records for all of our collections and that those MARC records have structured subdivided subjects with terms and term types
2. Get a MARC XML export of all of our archival collections from our catalog
3. Use a combination of scripts to make a csv of all of our unique EAD subjects with subdivided subjects split up into individual terms and a csv of all of our MARC subjects with each individual term and term type identified
4. Run a script that identifies the term type for all individual terms and outputs a csv with all of our unique EAD subjects with individual terms and term types included 
5. Use the API to post all of our subjects to ArchivesSpace correctly, outputting a csv with each subject and the uri of the posted subject in ArchivesSpace
6. Add the posted subject uris to our EADs as ref attributes. We know this is invalid EAD and we know it's wrong, but one of the other ways we've found of getting around the limitations of EAD for this migration is to ignore them! (We're also saving all of our "break the EADs" scripts until just before we migrate)


7. Modify the EAD importer to use a ref attribute to link to existing subjects instead of creating subjects during the import process

That's all it takes to use the API to create over 10,000 well-formed ArchivesSpace subjects! Max has done something similar to split up our agents into their component parts and post them using the API to take advantage of the more structured nature of ArchivesSpace person, corporate entity, and family name records.

Accession Migration

This one actually isn't done yet. As Max's recent post explained, we've recently started looking at migrating our legacy accession records from FileMaker Pro. ArchivesSpace has an accession csv importer but, due to the limitations of the converter, our own messy data, and some of the more complicated things we want to do with our accession imports, we'll need to make some local customizations to how the migration will be done. One option is to modify the accession csv importer as we have modified the EAD importer but, as we've been learning more and more about the API, we've realized that it will be much easier for us to come up with our own accession migration script that will use the API to migrate our accession records in the way that we want with the data that we have. We'll definitely be writing about the process along the way!


3 comments:

  1. Great primer!

    Another easy way to test endpoints is to use something like Postman, which gives you a GUI for making requests (and the results are a bit more legible, IMO. Plus it will save your previous requests, making it easy to reuse/tweak things without excessively typing (and, if you're me, mistyping) URL patterns.

    ReplyDelete
  2. Thanks for the recommendation! I had heard of Postman but for some reason thought it was something way more complicated. I just installed it and played around with it a bit and it's awesome. I can definitely see myself using this more than curl to test endpoints, although I'm still pretty confident in my ability to mistype repositories as respositories every so often.

    ReplyDelete
  3. Thank you for the intro - have you observed any speed limitations using this API? How many threads can you hit it with before it chokes etc.

    ReplyDelete