Monday, October 5, 2015

Tools for the programming archivist: ead manipulation with python and lxml

LXML is an awesome python tool for reading and editing xml files, and we've been using it extensively during the grant period to do programmatic cleanup to our legacy EAD files. To give an example of just how powerful the library is, late last week we ran a script to make tens of thousands of edits to all of our ~2800 EAD files, and it took all of 2 minutes to complete. This would have been an impossible task to complete manually, but lxml made it easy.

We want to share the love, so in this post we'll be walking through how we use the tool to make basic xml edits, with some exploration of the pitfalls and caveats we've encountered along the way.

Setup

Assuming you already have a version of python on your system, you'll first need to install the lxml library

In an ideal world, that should be as easy as running "pip install lxml" from a command-line. If that doesn't work, you have a few options based on what your OS is:

  1. If you're on a mac, try "sudo pip install lxml", and type in your password when prompted.
  2. If you're on windows, you may need to run a special installer. Try this official version. You may be required to first install these two dependencies: libxml2 and libxslt.

We'll also need an ead file (or directory of files) to work on. For this demo we'll be using the ead for the UMich gargoyle collection (raw data here).

Basic usage


Parsing the ead file

First, we need to point lxml to the input ead.

# we're only going to use the etree module (short for "element tree")
from lxml import etree
tree = etree.parse("path/to/gargoyle.xml") # replace the path text with your own filesystem path
view raw lxml_1.py hosted with ❤ by GitHub

Now we have an lxml "etree" or "element tree" object to work with, and we can do all sorts of things with it. From this parent tree, we can now select individual tags or groups of tags in the ead document to perform actions on, based on just about any criteria we can come up with. To do this we'll first need to use an "xpath" search:

extents = tree.xpath("//extent")
view raw lxml_2.py hosted with ❤ by GitHub

Using xpaths

There are a few things to know about lxml's xpath function:

First, it takes input in the xpath language standard, which is a standardized way to designate exact locations within an xml file. For example, the above search returns a list of every extent tag appearing in the ead file -- the double slashes at the beginning mean that it should look for those tags anywhere in the document. If I wanted to be more specific, I could use an exact search, which would be something like "/ead/archdesc/did/physdesc/extent". We will only be going into basic xpath usage here, but the language is ridiculously powerful - if you're curious as to more advanced things you can do with it, check out this tutorial.

Second, an xpath search always returns a list, even if only one result is found. It's really easy to forget this while writing a quick script, so if you're getting errors talking about your code finding a list when it didn't expect one, that's probably the reason.

A few more xpath examples:

# to find all unitid elements whose parent is a did tag:
tree.xpath("//did/unitid")
# using an absolute path to find exact locations:
tree.xpath("/ead/archdesc/did/physdesc/extent")
## if there are multiple "extent" tags in the parent physdesc,
## you can find specific tags by designating an index
## unlike any other language ever, xpath indexes start at 1, not zero
# returns a list containing the first extent in the parent physdesc
tree.xpath("/ead/archdesc/did/physdesc/extent[1]")
## You can also facet by tag attributes:
# finding all extent tags with an encodinganalog attribute
tree.xpath("//extent[@encodinganalog])
# finding all container tags with a type attribute whose value is "box"
tree.xpath("//container[@type='box']")
view raw lxml_3.py hosted with ❤ by GitHub

Accessing individual tags and their data

The xpath search will give us a list of results, but to look at or edit any individual tag we'll need to grab it out of the search results. Once we have an individual element (lxml's representation of the tag) we can start to access some of its data:

# grab a single extent. Remember xpath returns a list, so we have to specify an individual element
>>> extent = tree.xpath("//extent")[0]
# now we have an "element" object, in this case an extent tag
# we can access all sorts of stuff from here.
# to get the text contained in the tag:
>>> extent.text
'3 linear feet and 1 outsize box'
# to get the tag type:
>>> extent.tag
'extent'
# to get a dictionary containing all the attributes:
>>> extent.attrib
{'encodinganalog': '300'}
# to get just one attribute from the extent
>>> extent.attrib["encodinganalog"]
'300'
# the previous example will give an error if the tag doesn't have the specified attribute
# for a more resilient option, use the dictionary .get method instead
>>> extent.attrib.get("encodinganalog", "")
'300'
# to get the parent element of the extent:
>>> parent = extent.getparent()
view raw lxml_text.py hosted with ❤ by GitHub

Tag manipulation

Ok! Now that we know how to get at subsections of the ead file, we can start doing some programmatic edits. In our experience, our edits fall into one of just a few categories of possible changes:

  • editing tag text
  • editing tag types
  • moving tags around
  • creating new tags
  • deleting old tags
  • editing attributes

We'll go through each of these and give some examples and practical tips from our own experience working with EADs at the Bentley.


Editing tag text

This is usually a fairly straightforward task, though there is one big exception when dealing with groups of inline tags. A simple straightforward example:

>>> extent.text
'3 linear feet and 1 outsize box'
# we don't use "outsize", so we'll make it "oversize" instead
>>> extent.text = extent.text.replace(" outsize ", " oversize ")
>>> extent.text
'3 linear feet and 1 oversize box'

This gets more complicated when you're dealing with a tag like the following:

<unittitle>Ann Arbor Township records, <unitdate>1991-2002</unitdate>, inclusive</unittitle>

Trying to access unittitle.text here will only return "Ann Arbor Township records, " and ignore everything afterwards. There is no easy way around this through lxml itself, so in these cases we've found it easiest to just convert the whole element to a string using the etree.tostring() method, doing some normal python string manipulation on that result, then converting it back into an element using etree.fromstring() and inserting it back into the ead file. That looks a little like this:

# create a text representation of the element
# etree.tostring() spits out all text, including tags and subtags
text = etree.tostring(element_with_complex_text)
# do the manipulation
text = text.replace("Prince", "Artist formerly known as Prince")
# transform that text back into an lxml element
new_element = etree.fromstring(text)
# get the index of the old element from its parent
parent = element_with_complex_text.getparent()
index = parent.index(element_with_complex_text)
# remove the old element
parent.remove(element_with_complex_text)
# insert the new one
parent.insert(index, new_element)

Don't worry if some of that didn't make sense -- we'll be going over more of the creating, inserting, and moving elements later on.


Editing tag types

The most straight-forward of edits. Here's an example:

>>> extent.tag
'extent'
>>> extent.tag = "physfacet"
>>> extent.tag
'physfacet'

Editing tag attributes

Attributes are accessed by calling .attrib on the element, which returns a python dictionary containing a set of keys (the attribute names) and their respective values:

>>> container = tree.xpath("//container")[0]
>>> container.attrib
{'type': 'box', 'label': 'Box'}

Editing the attributes is a pretty straightforward task, largely using python's various dictionary access methods:

# accessing attribute values:
>>> container.attrib.get("type", "")
"box"
>>> container.attrib.get("location", "")
"" # since the "location" attribute does not exist, the .get() function returns an empty string
# changing a current value or creating a new attribute
>>> container.attrib["type"] = "folder"
# creating a new attribute only if the attribute does not already exist:
>>> container.attrib["label"] = container.attrib.get("label", "Folder")
# deleting an attribute
>>> del container.attrib["label"]

Deleting tags

Here you will need to access the parent tag of the tag to be deleted using the element's .getparent() method:

>>> parent = extent.getparent()
>>> print(etree.tostring(parent)) # printing out the parent just for comparison purposes
'''
<physdesc altrender="whole">
<extent encodinganalog="300">3 linear feet and 1 oversize box</extent>
</physdesc>
'''
>>> parent.remove(extent)
>>> print(etree.tostring(parent))
'''
<physdesc altrender="whole">
</physdesc>
'''

Creating tags

There are two primary ways of going about this - one long and verbose, and the other a kind of short-hand built in to lxml. We'll do the long way first:

# make a new, empty tag
>>> new_tag = etree.Element("extent")
# add some text
>>> new_tag.text = "25 embarrassing photos"
# add an attribute if you want any
>>> new_tag.attrib["encodinganalog"] = "300"
# insert the new tag into the master ead tree
# we'll use the parent element from last time
>>> parent.append(new_tag)
>>> print(etree.tostring(parent))
'''
<physdesc altrender="whole">
<extent encodinganalog="300">25 embarrassing photos</extent>
</physdesc>
'''

The alternate method is to use lxml's element builder tool. This is what that would look like:

# import the part of the library you'll need
>>> from lxml.builder import E
# the basic format for using E is:
# E.[name of tag]([tag text], [attribute name]=[attribute value], [anything else that comes inside the tag])
# A single-tag example:
>>> new_element = E.extent("25 photographs", encodinganalog="300")
# printing to see the results
>>> print(etree.tostring(new_element))
'<extent encodinganalog="300">25 photographs</extent>'
# Complex tag-building with the E tool:
>>> element = E.did(
E.unittitle("Ann Arbor Superheroes"),
E.container("Box 1", type="box"),
E.physdesc(
E.extent("25 photographs", encodinganalog="300"),
altrender="whole"
)
)
# the results
>>> print(etree.tostring(element, pretty_print=True))
'''
<did>
<unittitle>Ann Arbor Superheroes</unittitle>
<container type="box">Box 1</container>
<physdesc altrender="whole">
<extent encodinganalog="300">25 photographs</extent>
</physdesc>
</did>
'''
view raw lxml_builder.py hosted with ❤ by GitHub

Moving tags around

The easiest way to do this is to treat the element objects as if they were a python list. Just like python's normal list methods, etree elements can use .insert, .append, .index, or .remove. The only gotcha to keep in mind is that lxml never copies elements when they are moved -- the singular element itself is removed from where it was and placed somewhere else. Here's a move in action:

# say we have a c01 element that looks like this:
'''
<c01>
[...skipping <did> tag for brevity]
<c02>
...
<note>This collection is haunted</note>
</c02>
</c01>
'''
# if we wanted to move the note up to the parent c0x-level element,
# this is how we'd do it
# grab the note element
# (we're assuming we already have the c01 element assigned to a variable "c01")
>>> note = c01.xpath("//note")[0]
# just appending the note to the end of the c01 element will do the trick:
>>> c01.append(note)
>>> print(etree.tostring(c01))
'''
<c01>
...
<c02>
...
</c02>
<note>This collection is haunted</note>
</c01>
'''
# if you want more location options than just the end of the element,
# use insert() with the index of the location you want to move things to:
>>> c01.insert(0, note)
>>> print(etree.tostring(c01))
'''
<c01>
<note>This collection is haunted</note>
...
<c02>
...
</c02>
</c01>
'''

Saving the results

Once you've made all the edits you want, you'll need to write the new ead data to a file. The easiest way we've found to do this is using the etree.tostring() method, but there are a few important caveats to note. .tostring() takes a few optional arguments you will want to be sure to include: to keep your original xml declaration you'll need to set xml_declaration=True, and to keep a text encoding statement, you'll need encoding="utf-8" (or whatever encoding you're working with):

# turn the root etree into a string
ead_text = etree.tostring(tree, xml_declaration=True, encoding="utf-8")
# write the string to your new EAD file
# the directory path and filename can be whatever you'd like
with open(os.path.join("path/to/ead/directory", "new_ead.xml"), mode="w") as f:
f.write(ead_text)

We can also pretty-print the results, which will help ensure the ead file has well-formed indentation, and is generally not an incomprehensible mess of tags. Because of some oddities in the way lxml handles tag spacing, to get pretty-print to work you'll need to add one extra step to the the input file parsing process:

# create a new parser object
parser = etree.XMLParser(remove_blank_text=True)
# when we parse the input file, add a parser argument
tree = etree.parse("path/to/ead.xml", parser=parser)
[... making edits ...]
# when making the output string, add a new pretty_print argument
ead_text = etree.tostring(tree, pretty_print=True, xml_declaration=True, encoding="utf-8")
# write the file
with open(os.path.join("path/to/ead/directory", "ead_filename.xml"), mode="w") as f:
f.write(ead_text)

Note that the new parser will effectively remove all whitespace (spaces and newlines) between tags, which can cause problems if you have any complicated tag structure. We had some major issues with this early on, and ended up writing our own custom pretty-printing code on top of what is already in lxml, which ensures that inline tags keep proper spacing (as in, <date>1926,</date> <date>1965</date> doesn't become <date>1926,</date><date>1965</date>), and to prevent other special cases like xml lists from collapsing into big blocks of tags. Anyone is welcome to use or adapt what we've written - check it out here!


Thanks for reading along! We've found lxml to be indispensable in our cleanup work here at the Bentley, and we hope you'll find it useful as well. And if you have any thoughts or use any other tools in your own workflows we'd love to hear about them -- let us know in the comments below!

1 comment:

  1. Super grateful for this post right now. I need to build a CSV with the filenames, titles, originators, and scopecontents of maybe 3,000 EAD files. I'm good at writing python scripts to make/do stuff but hadn't with XML before. I know lxml could help me with this but I was feeling a little overwhelmed...so these samples are great.

    ReplyDelete