Thursday, June 25, 2015

Digital Preservation Cribs, BHL Edition

Last week, we "sat down" with Abbey Potter at the Library of Congress to kick of a new series of interviews run by the National Digital Stewardship Alliance (NDSA) Infrastructure Working Group. Modeled in part after MTV Cribs, a show that featured (actually, still features!) tours of the houses and mansions of celebrities, the "Digital Preservation Infrastructure Tours" series asks individuals to answer questions about their organization and the technologies and tools they use (i.e., their houses and mansions) to serve as case studies in digital preservation systems.

"Here comes the rooster digital preservation system!"

So... welcome! The post went up yesterday on the Signal!

Friday, June 19, 2015

Normalizing Dates with OpenRefine

A previous post discussed some of the work we've been doing to normalize the hundreds of thousands of dates in our EADs in preparation for ArchivesSpace migration. Specifically, the post detailed why such normalization is desirable and how we wrote a Python script to automate the process of normalizing 75% of our dates. But what about the remaining 25% that did not conform to the standard and easily-normalizeable forms of dates that we identified?

Enter OpenRefine

We've mentioned the prominent role that OpenRefine has taken in our legacy metadata cleanup on the blog before, but have not yet gone into detail about how we've integrated it into our legacy EAD cleanup project. Before we get into the specifics of using OpenRefine, we'll first need to get our dates into a format that is easy to work with in the tool.

OpenRefine supports TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents. Although EADs are XML documents, we did not want to have to normalize dates in each XML document individually in OpenRefine. Instead, we opted to output all of our non-normalized dates to a single CSV using Python, so that we could clean up our dates in OpenRefine in spreadsheet format.

The following Python script is what we've been using to output the non-normalized dates to a CSV:

This script outputs the following information for each non-normalized date to each row in the CSV: the filename of the EAD, the XPath (the unique position in the XML) of the element, and the text of the element. This is sufficient information for us to normalize the date and to then insert the normalized version back into the EAD as a 'normal' attribute.

Once the dates are in a CSV, a project can be created in OpenRefine, and the date normalization can begin.

Cleaning Data in OpenRefine: First Steps

The first step I like to take when working on a spreadsheet in OpenRefine is to create a working column for the data that I'll be working with, so that I can reference what the data looked like when it entered OpenRefine and compare that to the changes that I have made. Since I'll be working with the data in Column 3, all I have to do is click the triangle next to the Column 3 header, and select "Add column based on this column..." from the "Edit column" drop down menu. By default, OpenRefine will prompt you to create a new column with the same values as the original. Enter a name for the new column (I'll be calling mine 'expression' since that's what the plain text form of the date is called in ArchivesSpace) and click OK.

One of the first things I noticed about the contents of the CSV was that there were forms dates (such as 1820s-1896) that should have been normalized with the Python script that we used to automate the normalization process. Upon further inspection, the reason that the date did not get normalized before was that there was some leading whitespace between the tag and the date text. OpenRefine makes getting rid of leading (and trailing) whitespace easy: click on the triangle next to the column name and select "Trim leading and trailing whitespace" from the "Common transforms" drop down under the "Edit cells" menu. 

This transformation removed leading and trailing whitespace on 8,088 cells, which will make working with them much simpler.

Another thing we came to find out after working with our data for a while is that we have some fields in our EADs that have random extra spaces or line breaks within them. It's safe to assume (at least in the case of dates), that those extra spaces and/or line breaks are accidental. To remove them, and make the data even easier to work with, choose "Transform" from the "Edit cells" menu and enter the following:

This will replace all instances of multiple spaces with a single space, making splitting columns, faceting, filtering, and so on much more manageable going forward. In our case, this transformation removed unnecessary duplicate spacing in about 1,000 cells.

Normalizing Dates

The best way we've found to work with cleaning data in OpenRefine is to break the data down into smaller, uniform groups, rather than overwhelm ourselves by trying to normalize all 85,000 dates at once. OpenRefine's faceting and filtering capabilities allow us to identify all instances of a very particular form of date and then normalize all instances of that form. However, narrowing down to particular forms of dates requires first surveying the data that we have. Taking a quick look through some pages of our OpenRefine project shows that we have dates of the following forms:

May 1909-July 1968
January 24, 1960
ca. 1966
October 24-28, 1976
September-November 1986
l908 [yes, that's the letter L]
½0/2002 [yes, that's a fraction]
Photographs [no, that is not actually a date]

And on and on and on. There is no way for us to automate the normalization of all of these various forms of dates, but with OpenRefine we can at least break them down into manageable and uniform chunks.

For example, let's say we want to normalize all dates like "May 1909-July 1968," or of the form "Month YYYY-Month YYYY." First we click the triangle next to the column name 'expression', and select "Text filter" from the drop down menu. This gives us a text box to enter either exact text or a regular expression to match on. Almost all of the filtering we do is with regular expressions; it's worth becoming familiar with them. The form of date we're after (at least for these purposes) can be expressed in a regular expression as follows:

Filtering on that regular expression gives us 5,027 matching rows to work with. 

Ultimately, what we'll want to end up with for this particular form of date is a normalized version of the form YYYY-MM/YYYY-MM (our example from earlier, for instance, should end up as 1909-10/1968-07). To make this a little more manageable, we can use OpenRefine to split our existing column into four separate columns (begin month, begin year, end month, end year), transform each of those values into the desired format, and then rejoin the columns in the proper order and with the correct punctuation. To split the column into four columns, do the following:

  1. Click the triangle next to the column name and select "Split into several columns..." from the "Edit column" drop down.
  2. In the pop up box, enter "-" (a single dash, no quotes) as the separator and uncheck the boxes for "Guess cell type" and "Remove this column." The column should now be split into the first set of Month YYYY and the end Month YYYY (keeping up with our example, we should now have one column for May 1909 and another for July 1968).
  3. Split each of these new columns into several columns, this time entering " " (a single space, no quotes) as the separator and leaving the "Remove this column" box checked. 
  4. Rename the resulting columns something short and easily identifiable. I've renamed mine "bm," "by," "em," and "ey" (for begin month, begin year, end month, and end year respectively).

Our spreadsheet now looks like this:

The next step is to make sure that some of the assumptions we've made about our data so far are correct, in particular the assumption that all of the strings of letters we captured with our regular expression signify months. To find this out, again click the triangle next to the column names for months (in my case, 'bm' and 'em') and select "Text facet" from the "Facet" drop down. A list of all values and the amount of times they appear in that column will now display on the left of the project. In this example, the text facet has revealed that our column containing months has 19 values (i.e., 7 more than we would assume), mostly due to abbreviations. Each facet can be clicked on or edited from within the facet pane, making it very simple to change all instances of, for example, "Aug" to "August" at once.

Once the column values contain only the 12 correct and complete spellings of months that we're after, the next step is to transform all of the alphabetical representations of months into their respective numeric representations. To do this, select "Transform" from the "Edit cells" drop down within the columns menu and enter the following replacement formula:

This is a change I make quite frequently, so I have this transformation formula saved for easy reference. Here it is as plain text for easy copying and pasting:

value.replace('January', '01').replace('February','02').replace('March', '03').replace('April', '04').replace('May', '05').replace('June', '06').replace('July','07').replace('August', '08').replace('September','09').replace('October', '10').replace('November','11').replace('December', '12')

Once we do that with our begin and end month columns, we're ready to join our months and years back together as the final normalized version. Open the drop down menu for any column and select "Add column based on this column..." from the "Edit column" drop down. Enter the name for the new column (I'll call mine "normal"), and enter the following in the "Expression" field:

Again, for easy copying and pasting: 

cells['by'].value + '-' + cells['bm'].value+ '/' + cells['ey'].value + '-' + cells['em'].value

This string merges the values of the four separate columns in the proper order and separated by the proper punctuation to form an ArchivesSpace acceptable normalized date.

As laid out in this post, the process of normalizing dates in OpenRefine can seem somewhat tedious. However, once you get the hang of a few simple functions in OpenRefine, you can start to quickly use them to isolate, clean, and normalize large chunks of data fairly quickly. We will undoubtedly talk about additional work we've done using OpenRefine in future posts, but in the meantime the OpenRefine wiki and documentation on GitHub is an excellent resource for further instruction about the powerful things the tool can do. In the coming weeks we'll also be detailing how we take our cleaned up data from OpenRefine and reinserted it back into our EADs. Stay tuned!

Monday, June 15, 2015

The Work of Appraisal in the Age of Digital Reproduction

With apologies to Walter Benjamin, I would like to reflect on some of the challenges and strategies associated with the appraisal of digital archives that we've faced here at the Bentley Historical Library.  The following discussion will highlight current digital archives appraisal techniques employed by the Bentley, many of which we are hoping to integrate into the forthcoming Archivematica Appraisal and Arrangement tab.

Foundations and Principles

In working with digital archives, the Bentley seeks to apply the same archival principles that inform our handling of physical collections, with added steps to ensure the authenticity, integrity, and security of content.

By and large, appraisal tends to be an iterative process as we seek to understand the intellectual content and scope of materials to determine if they should be retained as part of our permanent collections.  If we're really lucky, curation staff and/or field archivists might be able to review content (or a sample thereof) prior to its acquisition and accession, a process that helps us pinpoint the materials we are interested in and avoid the transfer of content that we have identified as out of scope or superfluous.

This pre-accession appraisal may not be possible for various reasons (technical issues, geographic distance, scheduling conflicts, etc.), but in the vast majority of cases, we have some level of understanding about the nature of digital content and its relationship to our collecting policy by the time it's received, from a high-level overview or item-level description in a spreadsheet.

Whatever the case, appraisal is a crucial part of our ingest workflow, as it helps us to:
  • Establish basic intellectual control of the content, directory structure, and/or original storage environment to facilitate the arrangement and description of content.
  • Identify content that should be included in our permanent collections as well as superfluous or out-of-scope materials that will be separated (deaccessioned).
  • Determine potential preservation issues posed by unique file formats, content dependencies, or other hardware/software issues.
  • Address copyright or other intellectual property issues by applying appropriate access/use restrictions.
  • Discover and verify the presence of sensitive personally identifiable information such as Social Security and credit card numbers.
As we strive to employ More Product, Less Process (MPLP) strategies to the greatest extent possible, it is important to employ tools and strategies that will avoid inefficiencies and ensure that appraisal occurs at an appropriate level of granularity.  I should also note that we take a nimble and common-sense approach to appraisal: not all procedures will be required for all accessions and in cases where donors provide detailed descriptive information for fairly homogeneous content, appraisal may be fairly minimal.

Characterizing Content

One of the first steps we take with a new digital accession is to get a high-level understanding of the volume, diversity, and nature of files.  We currently glean much of this information from TreeSize Professional, a proprietary hard disk space manager from JAM Software (similar open-source applications include WinDirStat and KDirStat.)

Directory Structure

The tree command line utility, available in both Windows and Linux/OS X shells, provides a simple graphical representation of the directory structure in an accession.

For very large or complex folder hierarchies, it may be difficult to keep track of the parent/child relationships within the tree output.  In these cases, it may be easier to review the output of dir or ls (in the Windows CMD.EXE shell, dir /S /B /A:D [folder] will provide a recursive listing of all directories within a folder) or to review the directory structure in a file manager or other application.

Relative Size of Directories

Knowing where the largest number or volume of files are located in a directory structure can be helpful in identifying areas of the accession that might require additional work or where more extensive content review will be required.  TreeSize Professional produces various visualizations of the relative size of directories in pie charts, bar graphs, and tree maps and permits archivists to toggle between views of the number of files, size, and allocated space on disk:

Clicking on an element of a graph or chart will permit you to view a representation of the next level, a process that may be repeated until you drill down to the files themselves.

Age of Files

Determining the 'age' of files requires analysis of filesystem MAC times (Modification, Access, and Creation times), which can be a little dicey, especially if content has been migrated from one type of file system to another (the specifics of which I won't try to get into...).  TreeSize permits archivists to create custom intervals to define the age of files and will create visualizations based on any of the MAC times (we generally use last modified, as it often coincides with creation dates and indicates when the content was last actively used).  Clicking on any of the segments in the graph will produce a list of all files associated with that interval:

While this information may not be useful if the donor has accidentally altered the timestamps during the transfer process, knowing that there are especially old files in an accession can help guide our review of content and prepare us for any additional preservation steps that might be required.  For instance, knowing that a collection includes word processing files in a proprietary file format from the 1990s might lead us to explore additional file format migration pathways if the content is of sufficient value.

File Format Information

We've also found it helpful to see information about the breakdown of file formats in an accession to better understand the range of materials and assist with preservation planning, in the event that high-value content is in a unique file format or is part of a complex digital object that requires additional preservation actions.  TreeSize presents a table of file format information (also available for download as a delimited spreadsheet or Excel file) that arranges content into file format types defined by the archivist ('video files', 'image files', etc.) and which includes the number and relative size of files associated with a given format.  A bar chart also provides a visual representation of the file format distribution; right-clicking on any format will give an option to see a complete listing (with full file paths) of associated content:

It's important to note that TreeSize Professional only reviews file extensions in producing these reports (as we're only looking for a general characterization of an accession, we can live with this potential ambiguity).  The 'Miscellaneous' or 'Unknown file types' in the first line of the above screen shot thus include files with extensions that have not been identified in the TreeSize user interface or operating system default applications for files.  If more accurate information is needed, running a file format identification utility such as DROID, fido, or Siegfried (all of which consult the PRONOM file format registry).  We actually have an additional step in our workflow that cycles through all files identified as having an 'extension' mismatch and uses the PRONOM registry and the TrID file identification tool to suggest more accurate extensions.

Duplicate Content

TreeSize Professional also has a default search that will identify duplicate content within a search location based upon MD5 checksum collisions, with results available in a table (and also via spreadsheet export).

I've long felt that managing duplicate content in a digital accession can be tricky business due to the amount of work required to make informed decisions.  From an MPLP approach, it doesn't make sense to weed out individual duplicate files, especially when it can be difficult (if not impossible) to determine which version of a file may be the record version.  In addition, we often find that 'duplicate' files may actually exist in more than one location for good reason.  For instance, a report may have been created and stored in one part of a directory structure and then stored again alongside materials that were collected for an important executive committee meeting.  We've therefore resigned ourselves to having some level of redundancy in collections and primarily use duplicate detection to identify entire folders or directory trees that are backups and suitable candidates for separation or deaccesioning.


Reviewing Content

While the steps described above help identify potential issues and information based on broad characterizations, we also manually review files to verify the potential presence of personally identifiable information and better understand the intellectual content of an accession.  In keeping with our MPLP approach, we only seek to review a representative sample of content (much as we do with physical items) and browse/skim documents to understand the nature and basic function of records.  In-depth review is reserved for particularly thorny arrangement/description challenges or high-value collections that require a more granular approach.

Identification of Personally Identifiable Information

We conduct a scan and review of personally identifiable information (such as Social Security numbers and credit card numbers) as a discrete workflow step, but it still constitutes an important aspect of appraisal.  At this point, we are using bulk_extractor (and primarily the 'accounts' scanner) with the following command (the '-x' options prevent additional scanners from running):

 bulk_extractor.exe -o [output\directory] -x aes -x base64 -x elf -x email -x exif -x gps -x gzip -x hiberfile -x httplogs -x json -x kml -x net -x rar -x sqlite -x vcard -x windirs -x winlnk -x winpe -x winprefetch -R [target\directory]  

We then launch Bulk Extractor Viewer, which allows us to review the potential sensitive information in context to verify if it represents a potential issue.

Based upon this review, we may delete nonessential content or use BEViewer's 'bookmark' feature to track content that will need to be embargoed with an appropriate access restriction.

Quick View Plus

The proprietary Windows application Quick View Plus (QVP) is our go-to tool when we need to review content.  In addition to being able to view more than 300 different file formats, QVP will not change the 'last accessed' time stored in the file system metadata.

The QVP interface is divided into three main parts in addition to the navigation menu and ribbon at the top of the application window. The right portion of the interface holds the Viewing Environment while the left-hand side is divided between the Folder Pane (which can also be used to review the directory structure) on the top and the File Pane on the bottom.

After QVP opens, the right and left arrows may be used to expand/collapse subfolders and navigate to the appropriate location in the Folder Pane. Once a folder has been selected, a list of its contents (both subfolders and files) will be displayed in the File Pane; after a file is selected, it will appear in the Viewing Environment.  While we've noticed some issues with the display of PDF files, QVP meets the vast majority of our content review needs.  Moving to the browser-based (and open source) environment of Archivematica, it will be interesting to see how well we are able to view/render content using standard browser plugins.  We'll keep you posted...

Image Viewers: IrfanView and Inkscape

While QVP can handle pretty much every raster image we throw at it, the 'thumbnail' interface of IrfanView is pretty handy when we need to browse through folders that primarily contain images: 

QVP is not able to render vector graphic files and so we employ Inkscape when we encounter such content.  

It's open source and freely available (and can also be used via the command line for file format conversion)--if you haven't checked it out, have at it!

Sound Recordings and Moving Images

When it comes to reviewing sound recordings and moving images, VLC Media Player is our preferred app.  An open-source project that supports a ton of audio and video codecs, VLC permits you to load an entire directory of audio/video as a playlist that you can then advance through.

Play controls are located at the bottom of the Media Player window; in addition to Play, Pause, and Stop buttons, the archivist may fast forward or reverse progress by adjusting the slider on the progress bar.

So...What's in Your Wallet?

At the end of the day, appraisal is about making informed choices concerning what we will include in our final collections and how that material will be arranged, described, and made accessible.  While some aspects can be automated (and there's clearly a lot more potential work that enterprising archivists/techies could explore via natural language processing, topic modeling, facial recognition software, automated transcription), there is also a need for human intelligence to decide what to deaccession, what to keep, and how it will be described.  Or at least that's our take--please feel free to share what tools and strategies you employ at your institution!

Tuesday, June 9, 2015

ArchivesSpace and Aeon Integration

In keeping with Max's recent post about digital preservation, digital humanities, and HASTAC 2015, the focus of this post will be about another topic that, while not directly related to our Archivematica-ArchivesSpace-DSpace Workflow Integration project, is an important component of our operations here at the Bentley, and one which is being taken into account as we are implementing new systems, particularly ArchivesSpace. 


The Bentley Historical Library has been using Atlas Systems' archives and special collections request and workflow management software, Aeon, since the beginning of 2015. Aeon has greatly improved our ability to the meet the needs of our patrons, allowing patrons to register online, request materials in advance of their visit, track the status of their requests, and view their request history. It has also helped us to enhance our ability to manage our collections, giving us the ability to know exactly which materials need to be retrieved or returned from offsite storage, which materials are on hold for patrons, and which materials are in use by staff, whether they are in conservation, processing, duplication, or elsewhere. 

The degree to which Aeon has been integrated into daily operations at the Bentley means that any collection management and/or access systems we implement going forward will need to be able to communicate with Aeon. As briefly mentioned on this blog (and as discussed elsewhere), ArchivesSpace and Aeon are not currently integrated[2]. This is not an immediate deal-breaker, as it is likely that we will still be using DLXS to provide access to our finding aids for some time after we have migrated to ArchivesSpace. Our patrons are already able to request materials from our finding aids in DLXS (check out the checkboxes) or through the University's catalog, Mirlyn. However, ArchivesSpace and Aeon integration would give us the ability to manage our collections with Aeon much more efficiently by, for example, sending the exact shelf location for materials to Aeon, providing detailed information about access restrictions, letting us know which materials are already checked out or on hold when a request is submitted (hooray for unique identifiers!), among other exciting possibilities.

So, while the bad news is that ArchivesSpace and Aeon integration has not happened yet, the good news is that it's definitely coming.

Northeast Aeon Users' Group Meeting

Last week I had the opportunity, along with my colleague Matt Adair, to attend the Northeast Aeon Users' Group Meeting at Yale University. The meeting included attendees from numerous institutions that either are currently or will soon be using both ArchivesSpace and Aeon, and the discussions and presentations reflected that fact.

The importance of future integration between the two systems was such an important concern for many of the meeting's attendees that the evening before the official meeting was dedicated solely to a discussion of potential integration between ArchivesSpace and Aeon. Representatives from Lyrasis, Atlas Systems, and numerous archives and special collections got together to discuss their status with ArchivesSpace and Aeon implementations and their desired features for future integration. Among the most discussed integration features were:

Locations Locations Locations

As a result of the recently released ArchivesSpace container management plugin, archival objects in ArchivesSpace can be linked to their 'top containers' in a much more meaningful way than before, and container locations, barcodes, profiles, and other characteristics can be managed separately than, but still remain linked to, components of archival description. One of the intriguing possibilities of this is that patrons might not need to know or care about which box a particular component of archival description is in; as long as ArchivesSpace knows that an archival object is linked to a particular physical container, and that that physical container is located on a particular shelf in the stacks, patrons could request materials based on the information that they want/need to know, and in the background ArchivesSpace could send Aeon the information that reference staff needs to know to retrieve the physical containers. This alone would be a boon to our use of Aeon at the Bentley, as currently reference staff must look up the location for each request in our FileMaker Pro database, which only contains locations for ranges of containers in a collection, not specific stacks locations for individual containers. Obviously, getting something like this to work for us would require further metadata clean up and enrichment on our end (we don't have exact locations for specific containers or barcoded containers, for example), but the degree to which it would improve our ability to manage our collections in ArchivesSpace and in Aeon makes it a compelling possibility.


Currently, when a patron at the Bentley places a request for restricted materials, all Aeon is able to tell us is that there is some sort of restriction on the material, and the restriction information needs to be reviewed. In order to find out more, reference staff must look in the collection's finding aid, or in physical books containing detail restriction information, to learn more about the details of the restriction. Are the materials closed for another 30 years? Are they open to patrons who have approval from the records' or papers' creator? Did the restriction actually expire 3 years ago? The answers to all of these questions can be found in our archival description, which means that ArchivesSpace could pass along that information to Aeon in a more sophisticated manner than just saying "Hey, there might be some sort of restriction here. Sort it out." This is another possibility afforded by ArchivesSpace and Aeon integration that would be a big improvement over our current practice.

Live Status Updates

Right now, requests for materials in Aeon (at least for us) are atomic: each request is a separate transaction, the requests don't know anything about one another, and we have no way of knowing, without performing a citation search in Aeon or by going to a location in the stacks, whether or not a container is currently checked out, on hold, or otherwise unavailable or associated with an existing active request. This is another experience that could potentially be mitigated by ArchivesSpace and Aeon integration. Since all archival objects in ArchivesSpace get unique identifiers, and each archival object can be meaningfully linked to its top container, ArchivesSpace and Aeon could theoretically communicate with one another to provide status updates for materials. ArchivesSpace could, for example, display whether or not a container was on the shelf, checked out by another patron, on loan, or in processing, giving patrons a realistic understanding of how readily available certain materials are, and giving reference staff a heads up that containers are on hold, in conservation, or in another temporary location before they make the trek to the permanent stacks location, only to discover that the container is elsewhere.

Shopping Carts

There was a lot of discussion at the ArchivesSpace-Aeon integration meeting about future integration incorporating some sort of shopping cart functionality, allowing patrons to build a list of requests across separate collections and submit those requests to Aeon all at once. Check out the Rockefeller Archives Center for an excellent example of this kind of functionality already in operation, and get excited about the possibility!

Now, all of this is not to say that these features will definitely be incorporated into future ArchivesSpace-Aeon integration. The exact functional requirements for integration still need to be worked out, and both the ArchivesSpace and Aeon development teams have other development priorities to attend to. The discussion at the Aeon Users' Group Meeting, however, was so overwhelmingly positive and enthusiastic about future integration, and the discussed features were so functional and broadly applicable, that I am confident that good things are coming. Keep an eye on the ArchivesSpace and Aeon email lists for updates!

[2] This is not 100% true. Check out the code and take a look at basic integration in action at USC.

Friday, June 5, 2015

Digital Preservation is for People: An Archivist's Take on the Digital Humanities

This past week, Dallas and I and a number of folks from the Bentley Historical Library attended the Humanities, Arts, Science, and Technology Alliance and Collaboratory (HASTAC) 2015 Conference in Lansing, MI. HASTAC is one of the premier digital humanities conferences around.


While not directly related to our Archivematica-ArchivesSpace-DSpace Workflow Integration project, HASTAC served as a good reminder of why it is we're doing what we're doing.

**Spoiler Alert!** 
It's not so that it's easier to manage or preserve digital archives (although we're very excited about that, thank you very much).

Digital Preservation is for People

The most important reason we're doing what we're doing is for present and future people [1]. People just like the ones we heard from at HASTAC. People who use our digital archives. People who get frustrated with them (and there were plenty of those). And yes, even those people who create their own "digital archives" (as much as we like to look down on their efforts with disdain--and throw what they do in quotes!--and think to ourselves: "That's not really a digital archive," or "That's not how I use the word curate.").

If you were to ask me, the most important finding of the 2015 National Agenda for Digital Stewardship is that we have to do a better job of connecting to researchers (they also found that we have to do a better job of connecting to the creator community, which I'll touch on a bit in this post).

That is, we have to start thinking outside of a well known box...

OAIS Reference Box Model [2]

...and start thinking about those people on the margins for whom we do what we do: Consumers (and Producers). Attending HASTAC was one of the ways we're trying to do just that.

The People

Whether you call them Consumers (with a capital "C") or consumers, researchers, users, end users, or just plain people, we certainly heard from them at HASTAC. A lot them were from one particular Designated Community (to borrow another term from OAIS), that is, digital humanists (certainly not the consumer I think most archivists have in mind when they create digital archives), and some were librarians and archivists (the consumer I think most archivists have in mind when they create digital archives). Here's what I found out...

People Use Our Stuff in Interesting Ways

One of the most exciting things I learned at HASTAC was that people actually use the digital archives we create. And not just to look a pretty pictures. They use them in new and exciting ways that push the boundaries of knowledge:

A traveler puts his head under the edge of the firmament [a metaphorical illustration of either the scientific or the mystical quests for knowledge] in the original (1888) printing of the Flammarion engraving. [3] [4]

On this point I'd like to quote from the National Agenda for Digital Stewardship:

Researchers increasingly seek not only access but enhanced use options and tools for engaging with digital content... Models for access continue to evolve as methods for analyzing and studying contemporary born-digital and historic digitized materials are available. 

One of the most exciting examples of this is the Australian Federal Election Speeches site, which allows you to explore and visualize speeches by Australian politicians in exciting ways. Built by a "roman historian turned digital humanist," Fiona Tweedie, the textual data underlying these visualizations comes from the Museum of Australian Democracy. In fact, Dallas and I attended a Software Carpentry workshop at HASTAC she co-taught where we used the Natural Language Toolkit to do some basic frequency analysis on, for example, a collection of inaugural addresses from American presidents (turns out the words "fear" and "terrorism" are much more common than they used to be)--the building blocks of a site like that one.

I know what you're thinking: Cool! So what? Well, these new uses of digital archives have big implications for how we process them and make them accessible to people. "PDF is where DHers go to die!" was one of my favorite quotes from the conference. While it's true that PDF/As meets archival standards, and that page-turning applications are cool, sometimes things like this actually inhibit the work that scholars want to do. To do this type of analysis, digital humanists would prefer to have some way to download all or a portion of the plain text of a digital archive. It's also nice when metadata is clean (more on that later) and structured and available for download in a similar way.

People Get Frustrated When They Use Our Stuff

Another very common theme at HASTAC was that when people use our stuff, often they get frustrated. Here are a couple things I heard, in no particular order (all archives have been anonymized to protect their identity):

  • Some archives are "not functional." Stated by Owen Fenton in a presentation on using a newspaper archive to trace the development of Northern Irish identity. 
  • Inconsistent metadata "shattered my dreams." Exclaimed (truly exclaimed!) by Frederico Pagello, describing the moment he realized that he would not be able to analyze all European crime fiction using the fairly comprehensive but very dirty bibliographic records he had been collecting. 
  • Digital archives are "stressful." This one I heard through the grapevine, but I think it had something to do with analyzing Enron emails.

I think these observations can break down into two categories: usability of our access mechanisms and metadata. On usability, I'd like to quote again from the 2015 National Agenda for Digital Stewardship: "Usability is increasingly a fundamental driver of support for preservation, particularly for ongoing monetary support." Read: People give us money for digital archives when they like what they see on the other end (online). We can get all nerdy about file formats and storage configurations, but I promise you that we're the only ones that get excited about these things. We will also never be able to hang our hats on the fact that we saved some bits if nobody ever uses them, or starts to but quits because they aren't usable.

So what do we do? First, we have to acknowledge that access, use and re-use are as important as preservation in digital curation (so yes, the website is part of your job). Then, work to make it things better, little-by-little. What do you make better? Here are a couple of ideas:

On metadata, I think we all already understand the issue. But how do we fix it? First, we have to get over the fact that we have dirty metadata. We can blame it on our predecessors all we want (and I'm as guilty about this as anyone else), but that doesn't actually help. Describing collections is the most time-consuming and labor-intensive part of the process because it has to be done by humans, and humans make mistakes. Deal with it.

After we get over it, we then have to commit to finding ways to make this better. I'd suggest that an important second step to addressing the metadata issues is to have a system of record, wherever that is. Having three places where you record descriptive information actually makes cleanup harder. Finally, start cleaning. Whether that's having a system in place to correct mistakes as you find them, little-by-little, or whether you're migrating to ArchivesSpace and as part of your legacy EAD import you decide that you have an unprecedented opportunity to systematically clean your metadata, get it done!

I should also note here that digital humanists get frustrated with all of the errors in OCR'd text. For better or worse, however, this didn't make me feel very compelled to change the way we process collections (except for maybe, I'll concede, very important--and very small!--collections, or by investing in OCR research and development). Transcription is time-consuming and expensive! MPLP all the way!

People Create Their Own Digital Archives

Finally, I met a lot of non-archivist people who create their own digital archives, and mean many different things when they say digital archive. There were many, many examples of this. Here are just a few:

  • Fortepan Iowa, an initiative by a historian and a communications professor (and a programmer) to make digitized images of Iowans through the years available online;
  • Nashville's New Faces, a not-quite-ready-for-production effort by a digital humanists to let immigrants from all over the world tell their own stories;
  • a map of installations along the Way of Santiago de Compostela;
  • Hoccleve Archive, an archive by a historian of resources for scholars, teachers, and students interested in Thomas Hoccleve, his works, and their textual history; and
  • citizen archivists! There was a lot of talk about these people. I had to throw them in somewhere.

In addition to creating their own digital archives, people like to contribute to existing digital archives, and see this as a way to overcome the mistrust that exists between people who have been marginalized and institutions like archives that often represent "official" history. Another of my favorite quotes from HASTAC was that "Crowdsourcing is counter-hegemonic." (As it turns out, I'm all about crowdsourcing, but I won't get into it here because it's slightly out of scope.)

So what does all this mean? How can we support these folks? I have to admit, I'm a bit at a loss here. I know there is something to be said about the role of a digital archivist, and how sometimes it's educational/consultative and not practical/hands-on. I also know there's something to be said about personal digital archiving initiatives. I'd be curious to know in the comments, though, if anyone has any specific ideas about how to support researchers who want to build digital archives for their own research, and the digital archives that they create.

Summing Up

By way of conclusion, HASTAC served as a good reminder that the reason (or at least one of the reasons) we do digital preservation is for the people who use the digital archives that we preserve and provide access to. They're important, and we can learn a lot from listening to them!

One last thing I'll mention. The "myth of immateriality" seems to be widespread in the digital humanities world (maybe for those who are humanists before digital humanists). While a couple of speakers acknowledged that the binary of analog and digital was a false binary (yes!), there were a number of speakers who seemed to imply that "the analog" is more real than "the digital." Time for some evangelism!

[1] Websites are for People, too. Thanks, Matt! I hope you don't mind that I borrowed your title.
[2] "OAIS-" by Poppen - Own work. Licensed under Public Domain via Wikimedia Commons -
[3] "Flammarion" by Anonymous - Camille Flammarion, L'Atmosphere: Météorologie Populaire (Paris, 1888), pp. 163. Licensed under Public Domain via Wikimedia Commons -
[4] This image was Scott B. Weingart's opening keynote, Connecting the Dots. I highly recommend it.

P.S. I'm very proud of myself for not making even one joke about a needle in a HASTAC.