Friday, July 31, 2015

Order from the chaos: Reconciling local data with LoC auth records

Arkheion and the Dragon, part II

By the end of last week's post/parable we had Library of Congress (LoC) name authority IDs for many of our person and corporation names, but had a lot of uncertainty as to whether these IDs had been matched correctly. The OpenRefine script we were using to query VIAF for LoC IDs also didn't support searching for any control access types beyond person and corp names.

We weren't quite satisfied with this, so after looking into some of our options, we decided to try a new approach: we would move from OpenRefine to Python for handling VIAF API queries and data processing, add a bit of web scraping, then use more refined fuzzy-string matching to remove false-positives from the API results. By the time we had finished, we had confirmed LoC IDs for over 6500 unique entities (along with ~2000 often hilariously wrong results) and, as an added benefit, were able to update many of our human agent records with new death-dates. All told the process took about a day.

Here's how we did it:


OCLC offers a number of programmatic access points into VIAF's data, all of which you can see and interactively explore here. Since we're essentially doing a plain-text search across the VIAF database, the "SRU search" API seemed to be what we were looking for. Here is what an SRU search query might look like:[search index]+[search type]+[search query]&sortKeys=[what to sort by]&httpAccept=[data format to return]

Or, split into its parts:
    ?query=[search index]+[match type]+[search query]
    &sortKeys=[what to sort by]
    &httpAccept=[data format to return]

There are a number of other parameters that can be assigned - this document gives a detailed overview of what exactly every field is, and what values each can hold. It's interesting to read, but to save some time here is a condensed version, using only the fields we need for the reconciliation project:

  1. Search query: how and where to find the requested data. This is itself made up of three parts:
    1. Search index: what index to search through. Relevant options for our project are:
      • local.corporateNames: corporation names
      • local.geographicNames: geographic locations
      • local.personalNames: names of people
      • local.sources: which authority source to search through. "lc" for Library of Congress.
    2. Match type: how to match the items in the search query to the indicated search index -- e.g. exact("="), any of the terms in the query ("any"), all of the terms ("all"), etc.
    3. Search query: the text to search for, in quotes
  2. Sort keys: what to sort the results by. At the moment, VIAF can only sort by holdings count ("holdingscount").
  3. httpAccept: what data format to return the results in. We want the xml version ("application/xml")

Putting it all together, if we wanted to search for someone, say, Jane Austen, we would use the following API call:
    ?query=local.personalNames+all+"Jane Austen"+and+local.sources+=+lc

The neat thing about web APIs is that you can try them out right in your browser. Check out the Jane Austen results here! It's an xml document with every relevant result, ordered by number of holdings for each entry worldwide, and including including full VIAF metadata for every entity. That's a lot of data when all we're looking for is the single line with the first entry's LoC id. This is where Python comes in.


Before we dive in to the code, here is the high-level workflow we ended up settling on:

  1. Query VIAF with the given term
  2. If there's a match, grab the LoC auth id
  3. Use the LoC web address to grab the authoritative version of the entity's name.
  4. Intelligently compare the original entity string to the returned LC value. If the comparison fails, then we treat the result as a false positive.

Let's dig in!

VIAF, LC, and Python

First, we wrote an interface to the VIAF API in python, using the built-in urllib2 library to make the web requests and lxml to parse the returned xml metadata. That code looked something like this:

You can see above that the search function takes three values: the name of the VIAF index to search in (which matches to one of our persname, corpname, or geogname tags), the text to search for, and the authority to search within (here LC, but it could be any that VIAF supports).

With the VIAF search results in hand, our script began searching through the xml metadata for the first, presumably most relevant result. All sorts of interesting stuff can be found in that data, but for our immediate purposes we were only interested in the Library of Congress ID:

Now that we had the LC auth ID, we could query the Library of Congress site to grab the authoritative version of the term's name. Here we used BeautifulSoup, a python module for extracting data from html:

Now we had four data points for every term: Our original term name, an unvetted LoC ID number and name, and the type of controlaccess term the item belongs to (persname, corpname, or geogname). As before, there were a number of obvious false-positives, but there were enough terms that we did not have nearly enough time to check through them individually. As Max hinted at in last week's post, this was fuzzywuzzy's time to shine.

Fuzzy Wuzzy was (not) a bear

(also not a Rudyard Kipling poem)

Max gave an overview of fuzzywuzzy, but just as a refresher: it's a python module with a variety of methods for comparing similar strings under different lenses, all of which return a "similarity score", out of 100. Here is what a very basic comparison would look like:

This is fine, but it's not very sophisticated. One of fuzzywuzzy's alternate comparison methods is much better suited for our purposes:

The token_sort_ratio comparison removes all non-alphanumeric characters (like punctuation), pulls out each individual word, puts them back in alphabetical order, and then runs a normal ratio check. This means that things like word order and esoteric punctuation differences are ignored for the purposes of comparison, which is exactly what we want.

Now that we had a method for string comparisons, we could start building more sophisticated comparison code - something that returns "True" if the comparison is successful, and "False" if it isn't. We started by writing some tests, using strings that we knew we wanted to match, and strings we knew should fail. You can see our full test suite here.

As a result of our testing, we decided we would need to have unique comparison methods for each type of controlaccess term we were testing - one for geognames, one for persnames, and one for corpnames. Geognames turned out to be easiest - in that case our test criteria was matched with a basic token_sort_ratio check - names were deemed correct when they had a fuzz score of 95 or higher. Both persnames and corpnames turned out to need a bit more processing before we had satisfactory results. Here is what we came up with:

With this script in hand, all we had to do was run our VIAF/LC data through it, remove all results that failed the checks, then use the resulting data to update our finding aids with the vetted LoC authority links (while removing all the links we had added pre-vetting). Turns out, we ended up with > 4200 verified unique persname IDs, ~1500 IDs for corpnames, and 800 geogname IDs, all of which we were able to merge back into our EADs using many of the methods Max described last week. We also output all the failed results, which were sometimes hilarious: apparently VIAF decided that we really meant "Michael Jackson" while searching for "Stevie Wonder". And, no, Wisconsin is not Belgium, nor is Asia Turkey.

This also gave us a great opportunity to update all of our persnames with death-dates if the LoC term had one and we did not. You can check out our final GitHub commit here - we were pretty happy with the results!.


In retrospect there is a lot we could improve about the process. We played things conservatively, particularly for persnames, so our false-positive checking code itself had a number of false-positives. Our extraction of LoC codes from the VIAF search API could be a lot more sophisticated than just mindlessly grabbing the first result - our fuzzy comparisons did a lot to mitigate that particular problem, but since VIAF sorts its results by number of holdings worldwide rather than by exact match, we were left to the capricious whims of international collection policies. The web-request code is also fairly slow - since we didn't want to inadvertently DDoS any of the sites we're querying (and we'd rather not be the archive that took down the Library of Congress website), we needed to set a delay in between each request. When running checks against >10,000 items, even just a one second delay adds up. Even so, it still runs in an afternoon -- orders of magnitude faster than manual checking.

We hope you've found this overview interesting! All code in the post is freely available for use and re-use, and we would love to hear if anyone else has tried or is thinking of trying anything similar. Let us know in the comments!

Friday, July 24, 2015

Arkheion and the Dragon: Archival Lore and a Homily on Using VIAF for Reconciliation of Names and Subjects

I'd like to begin today with a reading from the creation narrative of archives lore, and a homily:

1.1 In the beginning, when Arkheion created libraries and archives, archives were a formless void.

There was no data that could be used to find materials that corresponded to a users' stated search criteria. There was no retrievable data that could be used to identify an entity. There was no data that could be used to select an entity that was appropriate to a users' needs. And, finally, there was no data that could be used in order to acquire or obtain access to the entity described.

Indeed, these were dark times.

1.2 Then Arkheion destroyed Tiamat, the dragon of primeval, archival chaos. 

"Marduk Arkheion and the Dragon" by Professor Charles F. Horne. Here Arkheion channels the spirit of Melvil Dewey, wields thesauri like the Library of Congress Subject Headings (LCSH) and the Getty Research Instittute's Art and Architecture Thesaurus (AAT), and destroys Tiamit, the dragon of primeval, archival chaos. [1]

In ancient times, libraries and archives began to organize their holdings. First by shape, then by content, then in catalogs. It wasn't long before bibliographic control as we know it today provided the philosophical basis for cataloging and our fore-librarians and fore-archivists began to define the rules for sufficiently describing information resources to enable users to find and select the most appropriate resource.

Eventually, cataloging, especially subject cataloging, and classification took root and universal schemes which cover all subjects (or, at least, the white male subjects) were proposed and developed.

1.3 Producers, Consumers and Managers lived in the fruitful, well-watered Garden of Arkheion.

And Arkheion saw that archives were good. 

1.4 But darkness covered the face of the deep. Producers, Consumers and Managers lived in paradise until the storied "fall" of Managers.

The "Fall of Man[agers]" by Lucas Cranach the Elder. [2]

To be human is to err. Human nature is fundamentally ambiguous, and it was only a matter of time before we strayed from the straight path of LCSH and AAT. We were like sheep without their shepherd. [3]

Why? Why! Maybe it was our haste. Maybe it was "human error." Maybe our fingers were just tired. Maybe we didn't want to keep track of names changes and death dates, OK! 

For better or worse, eventually the terms we were entering in our Encoded Archival Descriptions (EADs) were not the terms laid out by the controlled vocabularies we were supposed to be using. We were forever, even ontologically, estranged from Arkheion.

1.5 We lived in darkness until a savior appeared, whom two prophets (the Deutsche Nationalbibliothek and the Library of Congress) had foretold (all the way back in 2012).

The Virtual International Authority File (VIAF) is an international service designed to provide convenient access to the world's major name authority files. It's aim is to link national authority files (such as the LCSH and the Library of Congress Name Authority File or LCNAF) to a single virtual authority file. A VIAF record receives a standard data number, contains the primary "see" and "see also" records from the original records, and refers to the original authority records. These are made available online for research, data exchange and sharing. Even Wikipedia entries are being added! Alleluia!

1.6 At long last, the Virtual International Authority File (VIAF) offered us a path to reconciliation with the thesauri used by Arkheion.

And that's what this post is about, using the VIAF Application Program Interface (API) to reconcile our <controlaccess> and <origination> terms!

1.7 Amen. So be it.

So say we all. So say we all!

Using VIAF to Reconcile Our <controlaccess> and <origination> Terms

Reconciling <controlaccess> and <origination> headings, or any headings, for that matter, to the appropriate vocabularies using VIAF is a fairly simple two-step process. Somehow, we managed to make it five steps...

As a result, this is a two-part blog post!

  1. Initial Exploration and Humble Beginnings

Reconciling our subject and name headings to the proper authorities is something we've been considering for quite some time. Since we were spending so much time cleaning and normalizing our legacy EADs and accession records to prepare them for import to ArchivesSpace as part of our Archivematica-ArchivesSpace-DSpace Workflow Integration project, we figured that we might as well spend some time on this as well.

We'd heard about initiatives like the Remixing Archival Metadata Project (RAMP) and Social Networks and Archival Context (SNAC) (and, of course, Linked Data!), but all of those seemed pretty complicated compared to what we wanted to do, which boiled down to adding an authfilenumber attribute to <controlaccess> and <origination> sub-elements (like <persname>, <famname> and <corpname>, just to name a few) so that they would populate the "Authority ID" field in ArchivesSpace.

What we wanted.

Why we wanted it (at least initially).

During our initial exploration, we ran across this GitHub repository. Using Google Refine and stable, publicly available APIs, the process described in this repository automatically searches the VIAF for matches to personal and corporate names, looks for a Library of Congress source authority record in the matching VIAF cluster, and extracts the authorized heading. The end result is a dataset, exportable from Google Refine, with the corresponding authorized LCNAF heading paired with the original name heading, along with a link to the authority record on

Cool! This was exactly what we needed! Even better, it used tools we were already familiar with (like OpenRefine)! And even better than that, it was designed by a colleague at the University of Michigan, Matt Carruthers. Go Blue!

All we needed to do was pull out the terms we wanted to reconcile. For all of you code junkies out there, here's what we used for that (at least initially--this turned out to be a very iterative process). This Python script goes through our EADs and spits out three lists, one each for de-duplicated <corpame>, <persname> and <geogname> elements:

As always, feel free to improve!

Then, we added those lists to OpenRefine, created column headings (make sure you read the README!), and replayed Matt's operations using the JSON in GitHub, and got this:


Simple! It didn't find a match for every term, but it was a good start! We were feeling pretty good...

It was about this time that we realized that we had forgotten a very important step, normalizing these terms in the first place! Oops!

  2. Normalizing Subjects (Better Late than Never!)

Note: For this, I've asked our intern, Walker Boyle, to describe the process he used to accomplish this.

Before we could do a proper Authority ID check, we needed to un-messify our controlled access terms. Even with our best efforts at keeping them consistent, after multiple decades there is inevitably a lot of variation, and typos are always a factor.

Normalizing them all originally seemed like something of a herculean task--we have over 100,000 individual terms throughout our EADs, so doing checks by hand would not be an option. Happily, it turns out OpenRefine has built-in functionality to do exactly this kind of task.

Given any column of entries, OpenRefine has the ability to group all similar items in that column together, allowing you to make a quick judgement as to whether the items in the group are in fact the same conceptual thing, and with one click choose which version to normalize them all into.

The first step in our process was to extract all of our control-access terms from our EADs along with their exact xml location, so that we could re-insert their normalized versions after we made the changes. This is really easy to do with Python's lxml module, and the process only takes a few seconds to run -- you can see the extraction code here. From there you just throw the output CSV into OpenRefine and start processing.

And it's super simple to use: select "Edit cells" -> "Cluster and edit") from the file menu of the column you want to edit, and choose which clustering method to use (we've found "ngram-fingerprint" with an Ngram size of 2 works best for our data, but it's worth exploring the other options). Once the clusters have been calculated, you just go down the list choosing which groups to merge into a chosen authoritative version, and which to leave alone. Once you're satisfied with your decisions, click the "Merge selected and re-cluster" button, and you're done!

Clustering in OpenRefine

To re-insert the changed values into our EADs, we just iterated through the normalized csv data, reading the original xml path for each item and telling lxml to assign the new text to that tag (you can see our exact implementation here). One git push later, we were done. The whole process took all of an afternoon. In just a few hours, we were able to normalize the entirety of our control-access terms: some 61,000 names, corporations, genre-forms, locations, and subjects. That's pretty incredible.

  3. Running the Normalized Versions through the LCNAF-Named-Entity-Reconciliation Process (Should Have Been the First Step, Oh Well)

From there, it was just a matter of exporting the CSV, creating a dictionary using the Name and LC Record Link columns, like so:

And reincorporating them back into the EAD, like so:

Zlonk! We were reconciled! Again, we were feeling pretty good...

But again we realized that we had gotten a little ahead of ourselves (or at least I did). After some sober reflection after the high of reconciliation, there were still a couple of things wrong with our approach. First, there were a lot of "matches" returned from VIAF that weren't actual matches. Some of these were funnier than others, and we tweeted out one of our favorites:

No, VIAF. Wonder, Stevie, 1950- is NOT the same as Jackson, Michael, 1958-2009.
— UM BHL Curation (@UMBHLCuration) July 9, 2015

Long story short, we needed a way to do better matching.

  4. Enter FuzzyWuzzy

No, not the bear. FuzzyWuzzy is a Python library to which Walker introduced us. It allows you to string match "like a boss" (their words, not mine)! It was developed by SeatGeek, a company that "pulls in event tickets from every corner of the internet, showing them all on the same screen so [buyers] can compare them and get to [their] game/concert/show as quickly as possible."

The following quote would make Arkheion proud:

Of course, a big problem with most corners of the internet is labeling (sound familiar catalogers and linked data folks?). One of our most consistently frustrating issues is trying to figure out whether two ticket listings are for the same real-life event (that is, without enlisting the help of our army of interns).

That is, SeatGeek needs a way to disambiguate the many ways that tickets identify the same event (e.g., Cirque du Soleil Zarkana New York, Cirque du Soleil-Zarkana or Cirque du Soleil: Zarkanna) so that in turn buyers can find them, identify them, select them and obtain them.

We employed FuzzyWuzzy's "fuzzy" string matching method to check string similarity (returned as a ratio that's calculated by how many different keystrokes it would take to turn one string into another) between our headings and the headings returned by VIAF. Walker will talk more about this (and how he improved our efforts even more!) next week, but for now, I'll give you a little taste of what FuzzyWuzzy's "fuzz.ratio" function is all about.

FuzzyWuzzy's fuzz.ratio in action.

As you can see, the higher the ratio, the fewer the number of <persname> and <corpname> elements to which we get to add an authfilenumber attribute (certainly fewer than just blindly accepting what VIAF sent back in the first place!). In the end we decided that it was better to have fewer authfilenumber attributes and fewer mistakes than the opposite! You're welcome future selves!

  5. The Grand Finale

Tune in next week for the grand finale by Walker Boyle!

Check out the exciting conclusion... Order from the chaos: Reconciling local data with LC auth records


This has been an exciting process, and, for what it's worth, easier than we thought it would be.

While we originally started doing this to be able to get Authority IDs into ArchivesSpace, we have been getting really exciting thinking about all the cool, value-added things we may be able to do one day with EADs whose controlled access points have been normalized and improved in this way. Just off the top of my head:

  • facilitating future reconciliation projects (to keep up with changes to LCNAF);
  • searching "up" a hierarchical subject term like Engineering--Periodicals;
  • adding other bits from VIAF (like gender values, citation numbers, publication appearances);
  • interfacing with another institution's holdings;
  • helping us get a sense of what we have that's truly unique;
  • making linked open data available for our researchers; and
  • of course, adding Wikipedia intros and pictures into our finding aids! That one hasn't been approved yet...

Can you think of more? Have you gone through this process before? Let us know!

[1] "Marduk and the Dragon" by Prof. Charles F. Horne - Sacred Books of the East *Babylonia & Assyria* 1907. Licensed under Public Domain via Wikimedia Commons -
[2] "Lucas Cranach (I) - Adam and Eve-Paradise - Kunsthistorisches Museum - Detail Tree of Knowledge" by Lucas Cranach the Elder - Unknown. Licensed under Public Domain via Wikimedia Commons -
[3] I know what you're thinking. If Arkheion is omniscient, omnipotent and omnipresent (even omnibenevolent!), how could this be so? Perhaps that's a topic for a future blog post. Also, Arkheion is not real; this whole story is made up.

Tuesday, July 21, 2015

ArchivesSpace Donor Details Plugin

In a previous post about our ArchivesSpace implementation, Max detailed some of the things that ArchivesSpace is not (or, at least, is not yet), including some of the desired and/or required features for our local implementation of ArchivesSpace that are not currently present in the out-of-the-box ArchivesSpace distribution. One of the lacking features that Max mentions relates to information that we track for people and organizations who donate collections to the Bentley. 

Each of our donors has a unique donor number that we use to track accessions, organize case files, detail provenance in our finding aids and catalog records, and to conduct other fundamental archival functions. For example, the donor number of the source of a collection is indicated in the 500 field of that collection's MARC record and in the acquisition information section of that collection's finding aid. Donor numbers are essential components of our legacy data, and will continue to play a significant role in our day-to-day operations going forward. The ability to store and generate donor numbers in ArchivesSpace is necessary functionality for us that is currently not supported in the stock application.

ArchivesSpace Agents

In ArchivesSpace, each person, family, corporate entity, or software that relates to a resource or accession as a creator, source, or subject is linked to that resource or accession from an agent record. This will provide some really great functionality in that we will only need to create an agent record for a unique entity once and will thereafter simply add that agent as an agent link, saving us the hassle of looking up a name in LoC, our MARC records, or our finding aids to ensure consistent use across collections each time we add it as a source, creator, or subject. The way this currently works for us does, despite our best efforts, occasionally lead to slight variations in spelling or forms of names and subjects across collections or typos in our finding aids due to the fact that we must type each subject or agent rather than select an existing subject or agent from a controlled value list (we're working on fixing those problems, though; more to come on that from Max and Walker soon!)

While having a controlled list of agents in ArchivesSpace will in many ways be an improvement over our current practice, it also means that our ArchivesSpace agent records will need to include all of the information we might need to track about an agent in one place. ArchivesSpace agent records contain many universally applicable fields that permit a wide range of users to generate agent records with links to authority records, various name forms, contact information, notes, and so on. We realize that not every institution will use donor numbers in the same way as us and, as Max notes in his blog post, "it wouldn't make sense for us to request ArchivesSpace to develop functionality that only works for one user," which means that it's up to us to find a solution that will meet our local needs.

Our first thought was to try to identify some field in the existing ArchivesSpace agent records that we could use to store donor numbers. We took a look at the options in agent name forms, contact details, and so on to determine if any of the existing fields would work.

ArchivesSpace person name form

We briefly considered adding multiple name forms for each donor -- one name form using the fields as they were intended in the proper DACS, RDA, or other fashion and another name form using a field, such as "Number" or "Qualifier" to store the donor number. We ultimately decided, however, that this would be a hacky approach, and it would be better to come up with a more sophisticated solution.

ArchivesSpace Plugins

One of the great things about ArchivesSpace that we have mentioned on this blog before, and will undoubtedly mention again, is the fact that ArchivesSpace is open source, allowing users to take a complete look at the inner workings of ArchivesSpace and modify or extend it to meet their needs. Even more significant is that the ArchivesSpace team has made it easy for users to add plugins to "customize ArchivesSpace by overriding or extending functions without changing the core codebase." Making changes to the core codebase could cause unintended issues down the road when upgrading to new version of ArchivesSpace, but plugins are modular extensions of the application that can be quickly copied over to a new version of ArchivesSpace, easily tweaked to accommodate any changes, and readily shared with others to allow other users who desire the same functionality to drop the plugin into their local ArchivesSpace instance.

The ArchivesSpace distribution includes a few exemplar plugins that demonstrate the power of plugins and the relative ease with which they can be implemented. This plugin, for example, generates a new unique accession number, following a predefined format, each time a new accession record is created. Additionally, Hudson Molonglo makes numerous plugins available on their GitHub, including the excellent container management plugin that we intend to use in our ArchivesSpace instance. Among the plugins that they have developed is a plugin to add additional contact details to an agent record. Browsing the existing ArchivesSpace plugins, combined with some existing understanding of how to create a plugin based on our experiences adding a custom EAD importer and from reading the blog posts "Writing an ArchivesSpace Plugin" by Mark Triggs and "Customizing your ArchivesSpace EAD importers and exporters" by Chris Fitzpatrick, inspired me to attempt to create a plugin to extend the existing ArchivesSpace agent model to include donor numbers (and other necessary but lacking fields).

Donor Details Plugin

After just a few days of occasional work on the plugin, I developed a functioning prototype that adds a Donor Details option to each agent record, including a field to enter a donor number and a check box that, when selected, will automatically generate a new donor number by iterating the current maximum donor number by one. The plugin also ensures that each donor number is unique, will only display the donor details field in an existing agent record if donor details exist for that agent, and will only allow users to select the auto-generate check box if there is not an existing donor number for the selected agent. The current code for the plugin, including slight variations for ArchivesSpace versions 1.1.2, 1.2.0, and 1.3.0, is on our GitHub. Here's how it works:

The donor details plugin currently consists of four directories: backend, frontend, migrations, and schema, each of which follow the naming conventions and functions detailed in the ArchivesSpace plugin README.


The backend directory contains one subdirectory, model, which is where the new backend modules and classes used in the donor details plugin are defined. Within the model directory are five files and another subdirectory, mixins. 

The four files ending in _ext.rb (agent_corporate_entity_ext.rb, agent_family_ext.rb, agent_person_ext.rb, agent_software_ext.rb) are extensions of existing components of the ArchivesSpace backend model. Each file contains only one line that extends the model for each agent type by including the donor details module. The donor details module itself is defined in the mixins directory, in the file donor_details.rb


Currently, the relationship between the donor details module and each donor detail record is one-to-many; each agent can have one array of donor details, which itself may have one or more donor detail records associated with it. This may change at some point (the relationship between agent and donor number should really be one-to-one), but I used existing components of agent records, such as contact details, that have a one-to-many relationship with agents as models for the donor details plugin, which is why it is designed as such (it also is not really a problem, since it will never matter that we can add multiple donor numbers to an agent as long as we don't). 

What this means in terms of the functionality of the plugin is that each agent record includes the donor details (plural) module, which itself contains records of the type donor detail (singular). This is an important distinction to note, as a good deal of the functionality of the plugin (not to mention the functionality of ArchivesSpace and other Ruby on Rails applications) is dependent upon using consistent naming conventions across all components. Throughout the entire plugin, including within the backend models, the database tables, the frontend displays, and the ArchiveSpace JSONModel schema definitions, the plural and singular forms of donor_detail are used consistently and deliberately following the required naming conventions. The DonorDetails module includes records corresponding to the donor_detail JSONModel, which are stored in the donor_detail table in the database, which are displayed using the donor_detail form template in the frontend, and so on. 

The remaining file in the backend/model directory, donor_detail.rb, is where the backend components of each singular donor detail record is defined, including where the links between the model and the relevant table in the ArchivesSpace database and the relationship between the model and the relevant JSONModel schema defintion are made. It's also where the bits of code that automatically generate a new donor number and that validate the uniqueness of donor numbers are located.


The code in backend/model/donor_detail.rb is another really great example of how extensible ArchivesSpace is when plugins are used in conjunction with the existing ArchivesSpace codebase. Functions in this file, such as auto_generate and validates_unique, are used throughout the ArchivesSpace application to automatically generate numbers and display strings and to enforce uniqueness constraints (see both in use here, for example). I didn't actually have to write any code to define the auto_generate or validates_unique functions. All I had to do was find existing applications of those functions in the ArchivesSpace code to figure out how to put them to my own use. There are plenty of examples of this kind of reusable, modular code across the ArchivesSpace application, and it is what really makes it possible to easily add new features and functionality to ArchivesSpace (that is, provided you have the time to search and browse through the code in the ArchivesSpace GitHub in to find those functions).


The schemas directory in the donor details plugin includes only two files, abstract_agent_ext.rb and donor_detail.rb.

The file abstract_agent_ext.rb works in much the same way as the previously discussed _ext.rb files: it extends the existing abstract_agent schema, specifically by adding a donor details array, containing one or many donor detail items, to each agent record.


The file donor_detail.rb is where the donor_detail JSONModel schema is defined:


This is where the properties of each singular donor detail record are defined. Currently, there are only a few properties: number (a string), number_auto_generate (a boolean), and dart_id (another string that Max and I just added this morning to explore linking our donors to the University's Development office donor database). "Number" and "dart_id" are fairly self-explanatory, and number_auto_generate is the property that will be set to true or false, in the ArchivesSpace staff interface, using a check box to indicate whether or not the donor number should be automatically generated by the system. Appending _auto_generate to a property is another one of those already-existing ArchivesSpace features (see it used elsewhere to indicate whether a sort name ought to be auto-generated) that can be used to later make use of yet another existing ArchivesSpace function when implementing the frontend check box. There really are a lot of useful reusable things in the ArchivesSpace code if you look for them!


Since we'll actually need these donor numbers to be stored somewhere in the ArchivesSpace database, we include a file, 001_donor_detail.rb, in the donor details plugin's migrations directory that creates a table with the necessary fields in the ArchivesSpace database. Here we also follow the proper naming conventions: the table we create is called donor_detail (same as the JSONModel schema and backend model) and the columns in that table are number, number_auto_generate, and dart_id (same as the names of properties in the donor_detail JSONModel schema). The table also includes columns for agent_person_id, agent_family_id, agent_corporate_entity_id, and agent_software_id, which will be populated with the unique identifier for an agent when it is created, to establish the relationship between the agent and donor detail tables.


The frontend directory contains all of the code that displays a donor detail form in the ArchivesSpace staff interface, so that staff can add donor details when creating or editing an agent and see any existing donor information when viewing an agent. The frontend directory contains three subdirectories: assets, locales, and views.

The assets directory contains only one file: donor_detail.crud.js, which includes the JavaScript that controls the check box and text field in the donor detail form. Essentially what it does is make the donor number text field read only and display the text "System generates when saved." This is another bit of code that I "borrowed" from the existing ArchivesSpace application, specifically from the bit of code that controls the check box for agent sort names.

The locales directory also contains only one file: en.yml, which is where the English language translations for each part of the donor details plugin are located. It's the file that ArchivesSpace uses to know what to display to staff using the frontend interface for things like "number," "dart_id," and so on. An en.yml file in a locales directory is another really helpful feature of ArchivesSpace (see the entire ArchivesSpace en.yml file). It allows you to centrally maintain, modify, or override the language that will be used in the ArchivesSpace public and staff interfaces and to then refer to the entry in the en.yml file using a variable name, rather than requiring you to change every instance of English language text that you or the ArchivesSpace developers would have used throughout numerous html files. 

The html files that integrate the donor details plugin into the staff interface are located in the views subdirectory, which in this plugin contains only an agents subdirectory. The views directory (and corresponding subdirectories) are where the code is located that actually makes a form or a field display in the ArchivesSpace web interface. The views/agents directory contains three files: _donor_details.html.erb, _form.html.erb, and show.html.erb

The only unique-to-this-plugin file in this directory is _donor_details.html.erb. The other two files override the stock ArchivesSpace agents _form.html.erb and show.html.erb files to make them display the form template defined in _donor_details.html.erb in addition to all of the already-existing agent forms.


The template in _donor_details.html.erb will display a label and text box for donor number along with an automatable check box if the donor number is blank, otherwise it will display only the donor number label and text box. As with a lot of components of this plugin, this code follows the standard form of other ArchivesSpace templates, such as the contact details template. 

The change that I made to _form.html.erb can be found on line 27, and the changes made to show.html.erb on lines 93-121. The form and show files essentially replicate the function of surrounding bits of code that do things like display the templates for name forms and contact details, and uses that code to display the template for donor details, including a label and text field for donor number, including a check box to automatically generate the number, and a label and text field for DART ID. 

The final product in the staff interface looks like this:

That may look like a small achievement considering all of the bits and pieces that went into making a tiny, two-fields and a check box form appear in ArchivesSpace agent records, But, when you consider that our local implementation of ArchivesSpace can now include core functionality that it did not have before, that tiny form is really quite significant, and learning how to develop a plugin and spending time troubleshooting all of the errors that arose due to me being an archivist, not a programmer, was well worth the effort. Working on this plugin was also a really useful lesson in just how extensible and modifiable ArchivesSpace is by design, and it opens up a world of possibilities for us as we start to dive into migrating our legacy accession records (plenty more on that to come), develop new conventions for using ArchivesSpace, and think about how we can use the system to its fullest extent.


We've said it on this blog before, but it's worth stressing again: we are archivists, not programmers. While this plugin has been fully functional and has not had any issues, yet, I would not be surprised if there are bits of the code that could have been written in a more understandable, concise, sophisticated, or otherwise programmatically correct way. I would be very appreciative of any feedback, criticism, pointers, etc. that any more experienced programmers reading this blog might have!

Friday, July 17, 2015

ArchivesSpace + BlackLight = ArcLight

We've mentioned elsewhere on this blog that as part of our Mellon-funded Archivematica-ArchiveSpace-DSpace Workflow Integration project, we are exploring other community initiatives (such as ArcLight) to identify possible synergies and integration points with our endeavors. We also mentioned in a post on what ArchivesSpace isn't that we are contributing to (and very excited!) about the ArcLight project out of Stanford. Well, we thought it was time to stop being coy and get on with it. This is a post on ArcLight--what it is, what it isn't, what's next, and how we've contributed to it so far!


Taken straight from their website, ArcLight is "an effort to build a Hydra/Blacklight-based environment to support discovery (and digital delivery) of information in archives, initiated by Stanford University Libraries."


ArcLight has three preliminary objectives:
  1. It will support discovery of physical and digital objects, including finding aids described using Encoded Archival Description (EAD) and, for the latter, presentation and delivery of digital materials. 
  2. It will be compatible with Hydra and ArchivesSpace. 
  3. It will be developed, enhanced and maintained by the Hydra/Blacklight community.

With regard to the first objective, we try to support discovery of physical and digital objects, but, as with many things, we could always to it better. We got especially excited about the fact that ArcLight hopes to support presentation and delivery of digital materials. Currently, our users have to jump back-and-forth between the system we use to display EADs and the system we use to provide access to our born-digital and digitized collections (and on top of that in order to actually look at those digital objects, end users have to download them) so this would be a major improvement!

With regard to the second objective, compatibility with Hydra and ArchivesSpace is exactly what we need (although not necessarily in that order). We'll be going live with ArchivesSpace (fingers crossed!) sometime between now and March 31st, 2016, and, as part of a MLibrary-wide effort we'll be moving to a Hydra-based implementation of Deep Blue in the "medium-term" (two to three years). In case you don't know the back story on that latter point, shortly after the grant was awarded, MLibrary decided to adopt Hydra as a repository platform. In light of this development, we hope to fulfill grant requirements by integrating DSpace into the workflow, but will give care to ensure that solutions for sharing data and metadata between systems will also be appropriate in a Hydra environment (and, in general, repository agnostic).

We're also excited about the fact that ArcLight could be integrated with ArchivesSpace. The process now to add or update EADs in our online display is rather cumbersome, involving at least three versions of a finding aid, two versions of the EAD, and up to four people (including one that's doesn't even work at the Bentley!) to get them up and make them live. Integration with ArchivesSpace would cut all of those numbers down to one.

Finally, with regard to the third objective, we're interested in something that is developed by the Hydra/Blacklight community for the same reason we're interested in something that is developed by the Archivematica, ArchiveSpace or DSpace community: the community.

ArcLight Design Process

If you're interested in the ArcLight design process, you can read more about it here. In short, it consists of three stages:

Discovery >>> Information Architecture >>> Interaction & Visual Design (and Development)

The second phase of "Discovery," which kicked off their user-centered design process to produce documentation to guide development, and which we have been contributing to, is just now coming to a close.  Stanford hopes to start the actual development of ArcLight by 2016, and I for one can't wait to see what happens!

What It Isn't

Needless to say, we got very excited about all the things that ArcLight might be. But before I go any further, I need to be upfront about three very important things that ArcLight is not:

  1. It isn't the novel.
  2. It isn't the comic.
  3. It isn't the album.

Three of the things that ArcLight is not. [1] [2] [3]

In fact, I think Andrew Berger, Digital Archivist at the Computer History Museum, said it best (in reference to yet another thing that ArcLight is not):

In All Seriousness, What It Isn't

We did actually get very excited about ArcLight, which is why we began contributing to the project in the first place. We see a lot of synergy with what their doing and what we're doing. In fact, sometimes we get so excited thinking about all the things ArcLight and, for that matter, Hydra, could be that we forgot one of their more important characteristics: that they aren't actually anything, at least not yet, at least not for us in a tangible way.

And the danger about that, of course, at least for [over?]eager archivists like ourselves, is that we end up wanting these systems to be everything! As a result, our Stakeholder Interest and Goals document, which I'll discuss in further detail below, is a bit pie-in-the-sky; we can only hope that our high expectations won't set us up for disappointment!

How We've Contributed

We've contributed in two significant ways to the ArcLight project, by conducting some user interviews and by contributing the aforementioned Stakeholder Interest and Goals document. I'll spend a little bit of time talking about the user interviews, but the rest of the post will mostly be devoted to the Stakeholder Interest and Goals document.

User Interviews

First, doing these user interviews was fun! It was especially interesting to hear from the perspective of both archivists (two from Curation and one from Reference) and researchers. It was also a good reminder that, from a usability standpoint, we probably should have been doing interviews (or something like them) all along.

These interviews have been transcribed and will feature heavily in the next stage of the ArcLight design process. In fact, I just received an email this week that they need to comb through the transcripts and note common issues raised, relevant user scenarios, good quotes that indicate user goals, etc. According to Gary Geisler at Stanford, they will likely share some summary/distillation of the interviews with the broader ArcLight design collaborator group later.

Stakeholder Interest and Goals document

We also created a Stakeholder Interest and Goals document. I've embedded it below (just to be fancy), but you can also take a look at it here.

Basically, this document gives an overview of our access stack (is that a thing?), the grant project and our interest in ArcLight, as well as an acknowledgement that we are ignorant about some parts of how this whole thing will work, that some of functionality we're about to describe may in actuality reside in a repository rather than in ArcLight.


In case you aren't interested in reading the whole thing, here are some of my favorite archivist goals (with commentary):
Attractive, best-in-class discovery interface for archival content, flexible enough to change as quickly as best practices for web design change.
It only seems to take two years for a website to look ten years old. Weird how that happens.
Support and recognize EAD elements for search and disambiguation.  At the same time, move away from presenting EAD finding aids as static objects. 
The current way we display EADs allows for some search and disambiguation of EAD elements, but for the most part, EADs don't end up looking or feeling a whole lot different online than when printed out. We're excited about all the ways a BlackLight interface might enhance searching and browsing online.
Support and recognize PREMIS rights and conditions governing access/use so these metadata are acted upon in conducting and presenting search results.  Limit access to archival components and digital objects in a range of ways based on PREMIS statements, including full embargo of highly sensitive content (such that this content would not display in public search results).
The PREMIS rights data model is certainly a richer representation of rights information than anything machine actionable that we do at the moment. We're actually gearing up herefor a big discussion of how rights information will be shared in Archivematica and ArchivesSpace later on today! I'm sure a future post on this blog will be devoted to rights and rights management as it relates to the work we're doing for this grant.
Integration with other platforms, like Aeon, Mirlyn (local catalog), HathiTrust, Archive-It, Archivematica, search engines and metadata aggregators such as ArchiveGrid and DPLA.
We want it all! Not everybody (and sometimes not even a majority of people) come to library and archives collections through the website. We want to enable everyone to find what they need however they end up getting here in the first place.
Assessment through search logs, analytics, download reports, etc.
Again, this is something we're trying to be better about. This would probably require authentication, especially as we are more concerned than ever about determining impact of archival collections on, for example, undergraduate education outcomes and success.
Ensure that any updates, revisions, or additions to ASpace descriptive, administrative, and rights metadata are immediately reflected in the ArcLight interface.
This would help with the cumbersome process I described above.


Some of my favorite researcher goals include:
Integrate discovery and display/streaming of digitized/born-digital content, so that researchers don’t have to switch from a discovery layer to a repository.
This was mentioned above, but to elaborate, this would actually prevent researchers from having to switch from a discovery layer to multiple repositories and find their way back again, which is currently the case, since we have different repositories for different media types.

Search options:
  • Provide full-text search of indexable content (including OCR’d scans, plain text, PDFs, Word documents, and/or other relevant file formats).
  • Limit searching/browsing/faceting to particular collections 
  • Permit fuzzy searching/approximation so that researchers do not need to know the exact spelling of subjects, names, or other keywords.

A brief "day in the life" look at our server logs showed that users were having a lot of trouble searching our collections. They were getting thrown off by searches that contained typos and by searches that didn't operate Google-style. We'd like for that not to be an issue.
Communicate relationships between physical and born-digital/digitized components of collections in a usable and meaningful way.
Context is something we talk about a lot here. We don't want researchers to limit themselves to our digital collections at the expense of our physical collections just because they are easier to access. Ideally, this aspect of ArcLight would work well for those with experience with archives and those without.
For complex digital objects, display files/groups of content/archival components in a meaningful way. Allow researchers to view all the files in a folder without (or files in a finding aid) without having to go back and forth to the finding aid.
We'd love to see examples of any repository that can do this in a meaningful way for hierarchical, heterogeneous collections of born-digital objects (not disk images). Feel free to leave a comment if this describes your repository!

Library Information Technology

We're currently working this out! As a stakeholder, they represent many other stakeholders, so their task is a bit more daunting!


So, as a recap, ArcLight is an exciting initiative out of Stanford that you should follow! Our grant project focuses heavily on more of the back-of-the-house aspects of the curation of digital archives, but an attractive and user friendly interface is the second half of the equation! In fact, it's often what attracts donors in the first place!

If you are interested in participating in the design process for ArcLight, please subscribe to the mailing list used for ArcLight design-related announcements and discussions by sending an email to:

arclight-design-join (at) lists (dot) stanford (dot) edu

[1] "Arc-Light-cover" by Source. Licensed under Fair use via Wikipedia -
[2] "Arclight (Marvel villain)" by Source. Licensed under Fair use via Wikipedia -
[3] "Arclightlau" by Source. Licensed under Fair use via Wikipedia -

Monday, July 13, 2015

Separation Anxiety!

To Save or Not to Save...

My old mentor, Tom Powers, used to say that the business of archives is not just about saving things, but also throwing them away--whether to adhere to collecting policies, conserve resources, or to help researchers identify truly essential records.  Identifying 'separations' is therefore an essential part of the appraisal process here at the Bentley Historical Library.  While your institution might refer to this process as 'weeding' or 'deaccessioning', I'm willing to bet that our goals are the same: removing out-of-scope or superfluous content from accessions before they become part of our permanent collections.

Of course, with this great power comes great responsibility, which leads to what my good friend David Wallace refers to as the 'dark night of the (archival) soul': can we fully anticipate future uses? Are we getting rid of things that some researcher, somewhere, at some time, might find truly useful?

Clearly, it's possible.

At the same time, we're in this for the long-haul, not a sprint; the overall sustainability of our collections (and institutions) furthermore demands the strategic use of our limited resources.  Saving everything "just in case" isn't good archival practice: it's hoarding.  We're also keenly aware that having staff review and weed content at the item-level is neither efficient nor sustainable (and certainly not the best use of staff time and salaries).  Balancing available resources and staff time with the widest possible (and practical) future uses of digital archives thus becomes the crux of the matter (and something we'll no doubt continue to wrestle with for many moons to come)...

Documenting our Policies

No matter what course of action we take, it remains important to make our thoughts and reasoning available if for no other reason than to manage stakeholder expectations.  Since revamping our processing workflows for physical and digital materials last year, we've put renewed emphasis on 'MPLP' strategies, especially the idea that any kind of weeding or separations should occur at the same level (i.e., folder or series) as the arrangement and description.  Item-level weeding is strictly avoided unless there are particularly compelling reasons to remove individual items (such as extremely large file sizes or the presence of sensitive personal information).

To ensure consistency (and transparency for donors and researchers), we apply the same criteria for separations to digital and physical items, as outlined in our processing manual.  The following categories of materials are thus typically not retained:
  • Out of scope material that was not created by or about the creator or items that fall outside of the Bentley's collecting priorities.
  • Non-unique or published material that is readily available in other libraries, another collection at the Bentley, or in a publication.  
  • Non-summary financial records such as itemized account statements, purchase orders, vouchers, old bills and receipts, cancelled checks, and other miscellaneous financial records.
  • "Housekeeping" records such as routine memos about meeting times, reminders regarding minor rules and regulations, or information about social activities.
  • Duplicate material.
During our review and appraisal of digital archives, we thus keep an eye out for entire directories that contain the above categories of materials.  Given the impracticality of looking at every file, we rely upon reviewing directory and filenames and then viewing/rendering a representative sample of content as needed (using tools mentioned in my previous post on appraisal).  The goal here is not to search for individual files that meet the above criteria, but to catch larger aggregations of content that are simply not appropriate for our permanent collections.

Automation Alley

As with other aspects of our ingest and processing workflows, we've tried to automate (or at least semi-automate) aspects of our separations/deaccessioning efforts.  Two examples of this were discussed in that aforementioned entry on appraisal: scanning for sensitive personal information with bulk_extractor (which still requires the manual review and verification of results) and the identification of duplicate content.

As I discussed our approach to separating duplicate content in that piece, I won't rehash it here (short version: we don't weed individual files, but will deaccession an entire 'backup' folder if it mirrors content in another directory.  Wait--was that a rehash? Sorry...).  I will, however, note that there have been some informative discussions on the topic of duplicates on the 'digital curation' Google group, including this thread on photo deduplication which has some great links...

Another strategy that we've employed in our current workflows and are developing with Artefactual Systems for inclusion in Archivematica's new appraisal and arrangement tab is functionality to separate all files with a particular extension.  Our primary use case has been to remove system files (such as thumbs.db, temporary or backup files, .DS_Store and the contents of __MACOSX directories, including resource forks) that we've considered to be artifacts of the operating system rather than the outcome of functions and activities of our collections' creators.

On this score, some recent discussions from the Digital Curation and Archivematica Google groups have been relevant.  Jay Gattuso, Digital Preservation Analyst at the National Library of New Zealand notes that their "current approach is to not ingest the .DS_Store files, as they are not regarded by as an artefact of the collection, more as an artefact of the system the collection came from."  This represents our general line of thinking, which has also influenced our approach to migrating content off of removable media and disk imaging: we are devoting our resources to the preservation of content rather than the preservation of file systems and storage environments (except in cases where the preservation of that file system or environment is essential to maintaining the functionality and/or accessibility of important content).

Chris Adams adds some important nuances to the same thread, reporting that .DS_Store files are "only used to store custom desktop settings and the most which would happen without them is that you'd lose a custom background or sort order" and noting that "Resource forks (i.e. ._ files on non-HFS filesystems) are far more of a concern because classic Mac applications often stored important user data in them – the classic example being text documents where the regular file fork had only plain text but the resource fork contained styling, images, etc. which are critical for displaying the document as actually authored."

Knowing that there could be important information in legacy resource forks reinforces the need to discuss record creation and management practices with donors as part of the accession process.  In many cases, however, these conversations aren't convenient or even possible (as when we deal with the estate of a deceased creator).  What do we do then?  A quick Google search reveals several tools to view/extract the contents of resource strikes me that it might be possible to put together a script that could cycle through resource forks and flag any that contain additional information and should thus be preserved.  Not having really worked with resource forks or (to my knowledge) encountering one that stored the kind of additional information Adams mentions, I don't know how feasible this would be.


So...where does this leave us?  

As part of our grant we want to be able to separate/deaccession material from within Archivematica by applying tags to specific folders/files or by doing a bulk operation based upon file format extension.  Once the deaccession is finalized, Archivematica would query the user for a description and rationale for the action and create a deaccession record in ArchivesSpace:

Beyond this development work, we're trying to inform our decision-making process for separations/deaccessions by better understanding researcher needs and expectations.  Our archivists have participated in HASTAC 2015 up at Michigan State University in addition to various digital humanities events here at the University of Michigan.  Knowing what kinds of data, tools, and procedures are gaining popularity will hopefully help us save more of the materials (and metadata) that researchers want and need.  Of course, there are still the Rumsfeldian unknown unknowns to contend with....

In any case, your input and feedback (and/or scalable solutions for what to do with those Mac resource forks) would be most gratefully appreciated: let us know what you think!

Tuesday, July 7, 2015

Git-Flow for Archival Workflows

We here at the Bentley Historical Library have been using GitHub for quite some time now. (Really, it's only been since May 19th of this year, so not even two months, but who's counting?) Since we have so much experience, we figured it was about time for a post on how we handle version control for our project to migrate all of our legacy EADs into ArchivesSpace using Git and GitHub (and no, they're not the same thing).

Git is not the same as GitHub. [1] Also, I also just learned that "git" is English slang for "unpleasant person."

GitHub is not the same as Git. [2] It turns out that GitHub is not a center for unpleasant people.

The Problem: Version Control

The following transcript is adapted from an actual four-minute chat conversation I may or may not have had with a colleague (who may or may not be Dallas). I think it describes our frustrations better than a narrative description could.

Names have been changed to protect the innocent (and the guilty, i.e., me!). Also, I'm just back from a vacation where I spent some time at the beach, so ocean animals are on my mind.

Anonymous White-Spotted Puffer [3]
10:45 AM
so, it sounds like anonymous red lion fish's thing got added to real_masters_all.
10:46 AM
that's probably my fault. if there are any big mistakes anonymous red lionfish can just fix those, maybe using a backup
has anonymous great white shark replaced the ead masters yet?

Anonymous Atlantic Ghost Crab [4]
10:47 AM
umm, yeah i dunno

Anonymous White-Spotted Puffer
i didn't realize anonymous red lionfish had done it to real_masters_all
10:48 AM
because anonymous red lionfish was working form a copy anonymous red lionfish had made

Anonymous Atlantic Ghost Crab
anonymous great white shark has not replaced ead masters yet but anonymous goldband fusilier and i have probably made our own changes already
but maybe anonymous red lionfish could take a copy of just the things in a csv
10:49 AM
and we could fold those back into the real masters. hopefully there won't be too much that needs to be fixed.

Anonymous White-Spotted Puffer

The problem was that there were too many people trying to do too many things at once to the same version (or two, or three) of our EADs; the problem was version control!

Even though, as I mentioned, we had been using GitHub for quite sometime to showcase and share our custom ArchivesSpace EAD Importer and the tools we've developed to clean or prep our legacy EAD and MARC XML for migration to ArchivesSpace, as well as to make changes to the Archivematica documentation (yes, I'm rather proud of this and this contribution--thanks again for showing us the ropes, Justin and Sarah!), we hadn't been using Git and GitHub the way they were intended to be used: to solve the problem of version control when working in teams whose members may or may not be working right next to each other everyday (or in our case, even on the same computers everyday).

After some discussion about the suitability of GitHub for this project (while we know a number of libraries and archives use GitHub for a variety of purposes, we're still not sure if there is any precedence for putting EADs on GitHub--maybe we're the first!), we decided to move forward with creating a "repo" for our working copy of the EADs. To fit in with the A-Team theme, we went with the name vandura, after the model of the GMC van used in the show.

We even figured out how to add a picture to our README file in Markdown:


We decided to retain the "Real_Masters_all" directory name (because that is so different from "Real_Masters" and "FindingAids/EAD/Master"--all actual directory names!) for our EADs to serve as a reminder of those dark times, in the not too distant past, when things seemed simple, and when we just made changes to our version of record as we pleased, without thought to the hard work of our colleagues that we may or may not have been overwriting (because hey, we'll never know, and there would be no way to prove it anyway!).

Wait, I've Heard of GitHub...What's Git?

Before we go on...

If you're like me (an archivist, not a programmer!) you may or may not have known that Git and GitHub are actually two different things. Git is a distributed version control system (that is, it does not work like a shared network drive does--neither copy of a project directory is any better or more 'authoritative' than any other, and team members collaborate on identical copies). GitHub is a web-based Git repository hosting service (which is why it is so popular with open source software like Archivematica and ArchivesSpace), which also offers it's own features (like forks and pull requests). Git is a tool that you mostly use in the terminal on your local computer, while GitHub is a service that you mostly use with a graphical user interface on the Internet.

Why Use Git and/or GitHub?

So Git is a version control system, and GitHub is used in conjunction with it for work in teams. Why use them?

  • Git and GitHub are not just for software, or for people with l337 h4x0r s|<1llz. In fact, both of these work extremely well for anything that is primarily text, whether that is your EADs in XML, your catalog records in MARC, your website in HTML or even your blog written in Markdown.
  • All the cool kids are doing it. Whether it's companies like Artefactual Systems, Inc. (Archivematica) or Lyrasis (ArchivesSpace), or any of the institutions on this list, GitHub has become the place that open source software is shared with others.
  • It's better than regular old backups. With Git, you make what are called "commits" (more on that later) with meaningful messages (e.g., "correcting spelling mistakes" or "changing id attribute to authfilenumber"). You can then go back and look at all of your commits, remember why you made a particular change you made, and even revert back to a version of a project before a particular commit. All of that is much more useful when looking back on the work you've done than seeing a backup of your project made at an arbitrary time by a computer.
  • It is distributed. Everything is local. See comment above about difference between this process and using a shared network drive.
  • Interns have a place where they can point to the work they've done. With GitHub, since interns have their own accounts and since there is an online, public record of every change they have ever made, interns can point to a place online where they can showcase their work for potential employers.
  • You don't have to be at the Bentley or using any particular computer to do some work. That's handy.
  • Everything that happens gets recorded. Check this out. That's right, all 418 changes we've made in the 27 days we've used Git and GitHub for our EADs. It's like an audit trail. And you know we digital preservation types like our audit trails.
  • Management of the whole process is much easier. While there are many hands working on the same set of files, only a few hands get to accept and merge what are called "pull requests" (again, more on that later) into the Bentley's repository.
  • GitHub will tell you when you're going to overwrite someone else's work! That's probably my favorite benefit. While this doesn't make the process of figuring out what to do about conflicts any easier, at least we know about them!

Convinced? I am.

And the How: How We're Using Git and GitHub for Curation Workflows

While we haven't even begun to scratch the surface of all the different operations you could do with Git and GitHub, here's the handful that we've found helpful so far, broken down into three stages: 1) the initial, project and daily setup; 2) the process for making changes; and 3) and the process for merging those changes with the Bentley's version.

Say what you want about my handwriting, but I think that's a pretty good rendering of a laptop, if I do say so myself.

The Setup (with Git and GitHub)

While Git comes standard Linux operating systems, it doesn't on Windows or Mac. We're a Windows shop, so there was some setup involved.

Once Per Lifetime

If you haven't already, join GitHub. The instructions are here. If you're using Windows like us you'll also need to download and install the latest version of GitHub for Windows.

Once Per Project

Fork the vandura (or any other) repository to your account online. This basically means make a copy of the repository on your account. Note that "repo," which you'll hear people say sometimes, is short for "repository" and is just a fancy word for folder with files or other folders in it, or a project directory. On GitHub, you can do this by navigating to the repository you want to fork and clicking Fork in the top-right corner of the page.

Create a local clone of your fork on your computer. In other words, make a copy of the repository on your local computer. You can do this by navigating to your fork of the repository on GitHub and copying the HTTPS clone URL in the right sidebar to your clipboard. Then, open the Git Shell application and type:

git clone 

Next you'll need to configure a remote for your fork (so it knows where it came from). Move into the project directory by typing:

cd vandura

Then check to see what the current remote is by typing:

git remote –v

Specify a new upstream remote repository by typing (pointing it to its origin):

git remote add upstream

Finally, verify the new upstream remote repository by typing:

git remote –v

Once (or Twice...) Per Shift (with Pictures!)

The rest of these instructions detail our day-to-day work, starting with syncing a local version of the files with the Bentley's master version. So here we go (with pictures--thanks, Devon! [5])...

It starts with syncing your fork, ensuring that what you have on your local computer matches what the Bentley has online (which may have been updated since you last sat down to do some work). After ensuring that you're in the appropriate directory, you do this by...

Using git fetch upstream to fetch new commits from the upstream repository.

Using git merge upstream/master to merge the changes from upstream/master into your local master branch.

Or, if changes were made to the upstream repository while you were making changes to your fork, you can apply those changes to your local version before applying your changes by...

Using git rebase upstream/master to "rebase" or merge the upstream repository with your fork and replay your changes on top of the upstream version before pushing your changes (I know, it's getting complicated).

Making Changes (with Git)

Now it's time to make changes! This happens the same way you'd make any other change to a file on your local computer--by opening the XML editor of your choice, for example, and making a change, or running a Python program. Git only gets involved when you are ready to "snapshot" files and record these snapshots on your local machine in preparation for version control (and Git, by the way, only gets involved on your local machine). 

Note: For those with some experience with GitHub, you'll notice that we aren't using different branches (e.g., a development branch and a master branch). This is because we are already using a working copy of our EADs to make changes (not the master). No branch needed! Plus, this makes the process that much easier to teach to others.

Sometimes we make small changes (such as correcting spelling mistakes, or adding or deleting boxes from a boxlist, &c., all of which happen to a single XML file). After making changes to a single we snapshot that file by...

Using git add [filename] to snapshot a single file in preparation for versioning.

Sometimes we make big changes (for example, adding an Authority ID attribute to <persname> elements, which changed 1386 files and 11761 <persname> elements at once) to multiple files. You can snapshot these by...

Using git add . to snapshot all files in a directory that have changed since the last commit in preparation for versioning.

Then we get them ready for versioning by...

Using git commit -m "[meaningful message]" to record file snapshots permanently in your version history.

Note: These steps for making changes can be repeated ad nauseam. You make commits as often as you think you make a meaningful change (that you may want to go back to later). Also, those messages are important! "updates" is not nearly as helpful as "separated boxes for use with aeon".

The Finish (with GitHub)

Now it's time to get GitHub involved, both your associated personal account and our team or institutional account.

For a Team Member

Upload all local commits to your account on GitHub in order to be able to merge them with the Bentley's account by...

Using git push to "push" those commits to your online account.

Finally, merge your account's version with the Bentley's version online by...

Making a pull request using GitHub.

For the Team

One of the adminstators for the Bentley account will then get a notification that a pull request (so called because Devon, for example, as an intern, does not have the ability to push to the main Bentley account, instead requesting that an administrator pull his changes instead) has been made. One of the administrators compares the changes that need to be made...

Comparing the changes that need to be made. This is incredibly helpful.

Based on that comparison, they either accept the changes or, if there is some sort of conflict, give him instructions (again, all online out in the open) to, for example, rebase to get the latest version of the EADs before making his pull request, and then accept...

Devon's changes have been merged with the Bentley's account. Notice that we're told that the latest change was Dallas merging Devon's pull request, and his meaningful commit message is shown next to the Real_Masters_all folder.

Kapow! Version controlled.

So Far, So Good

While there is a bit of a learning curve to using Git and GitHub (thanks again, Justin and Sarah, as well as Greg and Fiona, the Software Carpentry folks at HASTAC who taught Dallas and I Version Control with Git!) and teaching it to others, implementing a version control system has been great! We are now able to see every change that has been made. We know who did it and when (and, ideally why!). We even know when we're about to overwrite someone else's changes. Life is good!

All that being said, we've experienced a few hiccups along the way and we're still working out our Git-flow. We'd love to hear what you're doing for version control or your experience with Git and/or GitHub. Let us know by leaving a comment or getting in touch via email or Twitter!

[1] "Git-logo" by Jason Long - Licensed under CC BY 3.0 via Wikimedia Commons -
[2] "GitHub logo 2013" by GitHub - Licensed under Public Domain via Wikimedia Commons -
[3] "Puffer Fish DSC01257" by Brocken Inaglory - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons -
[4] "Ocypode quadrata (Martinique)" by Free On Line Photos. Licensed under No restrictions via Wikimedia Commons -
[5] Since these screenshots were done as Devon worked, they sometimes get a bit out of order...