Monday, October 5, 2015

Tools for the programming archivist: ead manipulation with python and lxml

LXML is an awesome python tool for reading and editing xml files, and we've been using it extensively during the grant period to do programmatic cleanup to our legacy EAD files. To give an example of just how powerful the library is, late last week we ran a script to make tens of thousands of edits to all of our ~2800 EAD files, and it took all of 2 minutes to complete. This would have been an impossible task to complete manually, but lxml made it easy.

We want to share the love, so in this post we'll be walking through how we use the tool to make basic xml edits, with some exploration of the pitfalls and caveats we've encountered along the way.

Setup

Assuming you already have a version of python on your system, you'll first need to install the lxml library

In an ideal world, that should be as easy as running "pip install lxml" from a command-line. If that doesn't work, you have a few options based on what your OS is:

  1. If you're on a mac, try "sudo pip install lxml", and type in your password when prompted.
  2. If you're on windows, you may need to run a special installer. Try this official version. You may be required to first install these two dependencies: libxml2 and libxslt.

We'll also need an ead file (or directory of files) to work on. For this demo we'll be using the ead for the UMich gargoyle collection (raw data here).

Basic usage


Parsing the ead file

First, we need to point lxml to the input ead.

Now we have an lxml "etree" or "element tree" object to work with, and we can do all sorts of things with it. From this parent tree, we can now select individual tags or groups of tags in the ead document to perform actions on, based on just about any criteria we can come up with. To do this we'll first need to use an "xpath" search:


Using xpaths

There are a few things to know about lxml's xpath function:

First, it takes input in the xpath language standard, which is a standardized way to designate exact locations within an xml file. For example, the above search returns a list of every extent tag appearing in the ead file -- the double slashes at the beginning mean that it should look for those tags anywhere in the document. If I wanted to be more specific, I could use an exact search, which would be something like "/ead/archdesc/did/physdesc/extent". We will only be going into basic xpath usage here, but the language is ridiculously powerful - if you're curious as to more advanced things you can do with it, check out this tutorial.

Second, an xpath search always returns a list, even if only one result is found. It's really easy to forget this while writing a quick script, so if you're getting errors talking about your code finding a list when it didn't expect one, that's probably the reason.

A few more xpath examples:


Accessing individual tags and their data

The xpath search will give us a list of results, but to look at or edit any individual tag we'll need to grab it out of the search results. Once we have an individual element (lxml's representation of the tag) we can start to access some of its data:


Tag manipulation

Ok! Now that we know how to get at subsections of the ead file, we can start doing some programmatic edits. In our experience, our edits fall into one of just a few categories of possible changes:

  • editing tag text
  • editing tag types
  • moving tags around
  • creating new tags
  • deleting old tags
  • editing attributes

We'll go through each of these and give some examples and practical tips from our own experience working with EADs at the Bentley.


Editing tag text

This is usually a fairly straightforward task, though there is one big exception when dealing with groups of inline tags. A simple straightforward example:

This gets more complicated when you're dealing with a tag like the following:

<unittitle>Ann Arbor Township records, <unitdate>1991-2002</unitdate>, inclusive</unittitle>

Trying to access unittitle.text here will only return "Ann Arbor Township records, " and ignore everything afterwards. There is no easy way around this through lxml itself, so in these cases we've found it easiest to just convert the whole element to a string using the etree.tostring() method, doing some normal python string manipulation on that result, then converting it back into an element using etree.fromstring() and inserting it back into the ead file. That looks a little like this:

Don't worry if some of that didn't make sense -- we'll be going over more of the creating, inserting, and moving elements later on.


Editing tag types

The most straight-forward of edits. Here's an example:


Editing tag attributes

Attributes are accessed by calling .attrib on the element, which returns a python dictionary containing a set of keys (the attribute names) and their respective values:

Editing the attributes is a pretty straightforward task, largely using python's various dictionary access methods:


Deleting tags

Here you will need to access the parent tag of the tag to be deleted using the element's .getparent() method:


Creating tags

There are two primary ways of going about this - one long and verbose, and the other a kind of short-hand built in to lxml. We'll do the long way first:

The alternate method is to use lxml's element builder tool. This is what that would look like:


Moving tags around

The easiest way to do this is to treat the element objects as if they were a python list. Just like python's normal list methods, etree elements can use .insert, .append, .index, or .remove. The only gotcha to keep in mind is that lxml never copies elements when they are moved -- the singular element itself is removed from where it was and placed somewhere else. Here's a move in action:


Saving the results

Once you've made all the edits you want, you'll need to write the new ead data to a file. The easiest way we've found to do this is using the etree.tostring() method, but there are a few important caveats to note. .tostring() takes a few optional arguments you will want to be sure to include: to keep your original xml declaration you'll need to set xml_declaration=True, and to keep a text encoding statement, you'll need encoding="utf-8" (or whatever encoding you're working with):

We can also pretty-print the results, which will help ensure the ead file has well-formed indentation, and is generally not an incomprehensible mess of tags. Because of some oddities in the way lxml handles tag spacing, to get pretty-print to work you'll need to add one extra step to the the input file parsing process:

Note that the new parser will effectively remove all whitespace (spaces and newlines) between tags, which can cause problems if you have any complicated tag structure. We had some major issues with this early on, and ended up writing our own custom pretty-printing code on top of what is already in lxml, which ensures that inline tags keep proper spacing (as in, <date>1926,</date> <date>1965</date> doesn't become <date>1926,</date><date>1965</date>), and to prevent other special cases like xml lists from collapsing into big blocks of tags. Anyone is welcome to use or adapt what we've written - check it out here!


Thanks for reading along! We've found lxml to be indispensable in our cleanup work here at the Bentley, and we hope you'll find it useful as well. And if you have any thoughts or use any other tools in your own workflows we'd love to hear about them -- let us know in the comments below!

1 comment:

  1. Super grateful for this post right now. I need to build a CSV with the filenames, titles, originators, and scopecontents of maybe 3,000 EAD files. I'm good at writing python scripts to make/do stuff but hadn't with XML before. I know lxml could help me with this but I was feeling a little overwhelmed...so these samples are great.

    ReplyDelete