Tuesday, August 9, 2016

Archivematica Users Group @ SAA

Greetings, all! The Bentley's Mellon grant team had a busy and exciting time last week in Atlanta during the annual meeting of the Society of American Archivists (SAA).  One of the highlights was Dallas's and my demonstration of current functionality in our ArchivesSpace-Archivematica-DSpace Workflow Integration project during the Archivematica Users Group meeting (hosted by the ever-gracious Dan Gillean) on Wednesday, August 3.
We've given a lot of demos over the past year, at conferences as well as to individual institutions and groups (including the Digital Preservation Coalition), but this presentation really stood out for us.  While we always get a lot of great questions from folks, several individuals suggested new and exciting functionality that could be added to the Appraisal Tab in future releases.

Thinking Bigger (and Better!)

First, Seth Shaw, Assistant Professor of Archival Studies at Clayton State University (and developer of the Data Accessioner), pointed out that a tree map visualization would be a helpful addition to the 'Analysis Pane' in the Appraisal Tab.

As it now exists, the Analysis Pane includes a tabular report on file format distributions in a given transfer as well as pie charts that depict this range by (a) number of files per format and (b) total volume of the respective formats:

Analyze this!

A tree map would give archivists an alternative means of visualizing the relative size of directories and files and give insight to where given file types are located. This information could be very helpful in terms of comprehending directory structure, understanding the nature of content in a transfer, and identifying content that might require additional resources during ingest (such as large video files). 

It's also important to note that not all tree maps are created equal, as different instantiations have different affordances.  For instance, TreeSize Professional yields a visualization that includes labels of directories and file formats and uses color coding to show the relative depth of materials in the folder structure of a transfer, but doesn't represent individual files:

Whose size? TreeSize!

WinDirStat, on the other hand, color codes individual file formats, represents individual files in the tree map, and highlights directories or file format types based upon the user's selection from its directory tree or file format list:

The Colors, Children!

Next, Susan Malsbury, Digital Archivist at NYPL, asked about the potential of including Brunnehilde in the Analysis Pane.  For those of you who are not in the know (which very recently included me!), "Brunnehilde" is (to quote its developer, Digital Archivist Tim Walsh)
a Python-based reporting tool for born-digital files that builds on Richard Lehane's Siegfried. Brunnhilde runs Siegfried against a specified directory, loads the results into a sqlite3 database, and queries the database to generate CSV reports to aid in triage, arrangement, and description of digital archives. Reports include:
  • Sorted file format list with count
  • Sorted file format and version list with count
  • Sorted mimetype list with count
  • All files with Siegfried errors
  • All files with Siegfried warnings
  • All unidentified files
  • All duplicates (based on a Siegfried-generated md5 hash)
Walsh's tool could provide much more granular information about the contents of a transfer and when combined with visualizations it would offer additional and highly interesting ways to review and appraise digital archives.

Malsbury also introduced a question of how the Appraisal Tab's new functionality could accommodate disk images.  While Archivematica does have some support for transfers comprised of disk images, our use cases for the grant project did not specifically address this content type.  As Gillean noted, this question begs for additional cross-platform workflow integration.  Since BitCurator is designed to handle disk images, it makes sense for members of the open source digital archives community to explore how it can work in conjunction with Archivematica rather than replicate its functionality in the latter platform.  (A sidebar conversation Max Eckard and I had with Sam Meister from Educopia and the BitCurator Consortium confirmed that this is an important area of inquiry...)

Next Steps...

We're in the final stretch of our grant project and—sad as it is to say—have come to realize that all the awesome ideas we've had for the Appraisal Tab aren't going to make it into the final product.  We will, however, have achieved all the major goals and deliverables that we established at the outset:
  • Introduce functionality into Archivematica that will permit users to review, appraise, deaccession, and arrange content in a new "Appraisal and Arrangement" tab in the system dashboard.
  • Load (and create) ASpace archival object records in the Archivematica "Appraisal and Arrangement" tab and then drag and drop content onto the appropriate archival objects to define Submission Information Packages (SIPs) that will in turn be described as 'digital objects' in ASpace and deposited as discrete 'items' in DSpace.   
  • Create new archival object and digital object records in ASpace and associate the latter with DSpace handles to provide URIs/'href' values for <dao> elements in exported EADs.
All the same, we're thrilled by the realization that the Appraisal Tab as it will exist in the upcoming version 1.6 of Archivematica is really just a beginning.  By developing the Appraisal Tab and introducing basic appraisal functionality (file format characterization, sensitive data review, file preview, etc.), we've dramatically lowered the bar for other institutions that want to integrate new tools or introduce new features.  (And yes, I did borrow liberally from Dan Gillean for that last thought!)  

We're really excited to see where other institutions and developers take the Appraisal Tab because—I, for one, would love to see textual analysis and named entity recognition tools like those in ePADD (or the other projects identified by Josh Schneider and Peter Chen in this great post from the SAA Electronic Records Section blog).  

What features or functionality would you like to see in the Appraisal Tab?  What questions do you have about our current processes? Please reach out to us via the comments section or email.

Thanks for reading; live long and innovate!

Friday, July 29, 2016

The Archival Integration Team at SAA2016

The Bentley's ArchivesSpace-Archivematica-DSpace Workflow Integration project team will be out in full force at the Society of American Archivist's Annual Meeting in Atlanta next week, where we'll be talking about some of the topics we've written about in detail on this blog, including our ArchivesSpace/Archivematica integration work, implementing new systems, preparing legacy description for migration to ArchivesSpace, and appraising digital content.

Be sure to add some of the following to your sched and stop by to get up-to-date information about our project or just to say hello!

Tuesday
ArchivesSpace Member Forum - I will be presenting on ArchivesSpace and Archivematica integration, particularly focusing on the ArchivesSpace pane in the new Archivematica Appraisal and Arrangement tab, some of the decisions we've made about creating and editing ArchivesSpace archival objects and structuring ArchivesSpace digital objects in Archivematica, and planned future enhancements (such as modifications to the ArchivesSpace Rights Statements module to facilitate mapping Archivematica PREMIS Rights to ArchivesSpace) during the "ArchivesSpace Integrations - A Status Report and Look Ahead" session.
Wednesday
Archivematica Users Group - Mike and I will be giving a brief presentation about and demonstration of the current Archivematica Appraisal and Arrangement tab.
Thursday
Aquisitions and Appraisal Section - Mike will be discussing the appraisal of born-digital content as part of a "discussion with several panelists who respond to an appraisal- and acquisitions-related scenario."
Friday
Graduate Student Poster Presentations - Devon will be presenting a poster on his contributions to our (recently completed!) project cleaning, reconciling, and ultimately "Preparing Legacy Finding Aids for Ingest into ArchivesSpace."
Session 506: You Are Not Alone! Navigating the Implementation of New Archival Systems - Max will be talking about our ArchivesSpace implementation (there will also be presentations about Archivematica implementation during this session)
Reference, Access, and Outreach Section - Max will be showcasing (along with our colleagues Cinda and Melissa) some of the ways in which the Bentley provides access to digital content in the RAO's Marketplace of Ideas.

Wednesday, June 8, 2016

Born-Digital Data: What Does It *Really* Look Like (Research Data Redux)

This is a follow up to Jenny Mitcham's recent Research data - what does it *really* look like post, and in particular, these questions she posed:
I'd be interested to know whether for other collections of born digital data (not research data) a higher success rate would be expected? Is identification of 37% of files a particularly bad result or is it similar to what others have experienced?

Background

Extracting technical metadata with the file profiling tool DROID has been part of our digital processing procedures for born-digital accessions since the beginning, so to speak, about 2012. Right before deposit into DeepBlue and our dark archive, a CSV export of DROID's output gets included in a metadata folder in our Archival Information Packages (AIPs). Kudos to Nancy Deromedi and Mike Shallcross for their foresight and for their insistence on standardizing our AIPs. It made my job today easy!

At first I was thinking that I'd write a Python script that would recursively "walk" the directories in our dark archive looking for files that began with "DROID_" (our standard prefix for these files) and ended with ".csv". That would have worked, but I'm a bit paranoid about pointing Python at anything super important (read: pointing my code at anything super important), and making a 1.97 TB working copy wasn't feasible. So, I took the easy way out...

First, I did a simple search (DROID_*.csv) in Windows Explorer...


...made my working copy of individual DROID outputs (using TeraCopy!)...



...and wrote a short script to make one big (~215 MB) DROID output file.



These are not the DROIDs you're looking for.


Note that I had the script skip over Folders (because we're only interested in files here), packaged files, like ZIPs (because DROID looks in [most of] these anyway) and any normalized versions of files, which I could identify because they get a "_bhl-[CRC-8]" suffix. Kudos again to Nancy and Mike for making this easy.

All of the data (about 3/4 million individual files!) in this sample represents just about anything and everything born-digital that we've processed since 2012... basically anything related to to our two collecting areas of the University of Michigan and the state. I'd guess that much of it is office documents and websites (and recently, some Twitter Archives). The vast majority of the data was last modified in the past 15 years, and our peaks are in in 2006 and 2008. The distribution of dates is illustrated below...


Here are some of the findings of this exercise:

Summary Statistics

  • DROID reported that 731,949 individual files were present
  • 658,520 (89.9%) were given a file format identification by Droid
  • 657,808 (99.9%) of those files that were identified were given just one possible identification. 610 files were given two different identifications, 1 file was given three different identifications, 3 files were given five different identifications, 13 files were give six different identifications, 45 files were give seven different identifications, 28 files were given eight different identifications, and a further 12 were given nine different identifications. In all these cases, the identification for 331 files was done by signature and the identification for 380 files was done by extension.

Files that Were Identified

  • Of the 658,520 files that were identified:
    • 580,310 (88.1%) were identified by signature (which, as Jenny suggests, is a fairly accurate identification)
    • 13,478 (2%) were identified by extension alone (which implies a less accurate identification)
    • 64,732 (9.8%) were identified by container. Like Jenny said, these were mostly Microsoft Office files, which are types of container files (and still suggests a high level of accuracy)
    • Lots of these were HTML and XML files, although there were some Microsoft Office files as well
  • 180 different file formats were identified within the collection of born-digital data
  • Of the identified files 152,626 (19%) were HTML files. This was by far the most common file format identified within the born-digital dataset. The top 10 identified files are as follows:
    • Hypertext Markup Language - 152,626
    • JPEG File Interchange Format - 142,161
    • Extensible Hypertext Markup Language - 62,039
    • JP2 (JPEG 2000 part 1) - 56,986
    • Graphics Interchange Format - 48,317
    • Microsoft Word Document - 38,459
    • Exchangeable Image File Format (Compressed) - 18,826
    • Microsoft Word for Windows Document - 18,140
    • Acrobat PDF 1.4 - Portable Document Format - 17,840
    • Acrobat PDF 1.3 - Portable Document Format - 10,875

Files that Weren't Identified

  • Of the 73,421 that weren't identified by DROID, 851 different file extensions were represented
  • 1,888 (2.6%) of the unidentified files had no file extension at all
  • The most common file extensions for the files that were not identified are as follows:
    • emlx - 21,987
    • h - 8,545
    • cpp - 8,501
    • htm - 8,032
    • pdf - 5,216
    • png - 4,250
    • gif - 2,085
    • dat - 1,419
    • xml - 1,379

Some Thoughts

  • Like Jenny, we do have a long tail of file formats, but perhaps not quite as long as the long-tail of research data. I actually expected it to be longer (10.1% seems pretty good... I think?), since at times it feels like as a repository for born-digital archives we get everything and the kitchen sink from our donors (we don't, for example, require them to deposit certain types of formats), and because we are often working with older (again, relative) material.
  • We too had some pretty common extensions (many, in fact) that did not get identified (including the .dat files that Jenny reported on). Enough that I feel like I'm missing something here...
  • In thinking about how the community could continue to explore the problem, perhaps a good start would be defining what information is useful to report out on (I simply copied the format in Jenny's blog), and hear from other institutions. It seems like it should be easy enough to anonymize and share this information.
  • What other questions should we be asking? I think Jenny's questions seem focused on their goal of feeding information back to PRONOM. That's a great goal, but I also think there are ways we can use this information to identify risks and issues in our collections and assure that our or our patron's technical environments support them, as well as to advocate in our own institutions for more resources.
And, if you haven't yet, be sure to check out the original post and subscribe to that Digital Archiving at the University of York blog! Also be sure to check out the University of York's and University of Hull's exciting, Jisc-funded work to enhance Archivematica to better handle research data management.

[1] I think Jenny's only interested in original files, but an interesting follow-up question might ask questions along the lines of what percentage of files we were able to normalize...

Monday, May 16, 2016

Introduction to Free and/or Open Source Tools for Digital Preservation


Over the weekend, Mike, Dallas and I gave a workshop entitled "Introduction to Free and/or Open Source Tools for Digital Preservation" as part of the Personal Digital Archiving 2016 conference. This hands-on workshop introduced participants to a mix of open source and/or free software (and some relatively ubiquitous proprietary software) that can be used via the command line or graphical user interfaces to characterize and review personal digital archives and also perform important preservation actions on content to ensure its long-term authenticity, integrity, accessibility, and security. It was awesome!

After introductions, we discussed:
  • Digital Preservation 101
    • Definitions
    • Challenges
    • Models
  • Tools and Strategies (the hands-on part!)
    • Characterizing and reviewing content
      • WinDirStat
      • DROID
      • bulk_extractor
    • File format transformations (I discussed this a bit in a recent blog post on the theory behind file format migrations)
      • Still Images
        • IrfanView
        • ImageMagick
      • Text(ual) Content 
        • Adobe Acrobat Pro
        • Ghostscript
      • Audio and Video
        • ffmpeg
        • HandBrake
    • Metadata for Digital Preservation
      • Descriptive Metadata
        • Microsoft Word
        • Adobe Acrobat
      • Technical Metadata
        • ExifTool
      • Preservation Metadata
        • MD5summer

In case you're interested, we thought we'd make the slides...



...and exercises....



available to a wider audience! Enjoy!

Monday, May 9, 2016

Grant Update: Extension through Oct. 2016

Greetings, all; while things have been a little quiet on our blog as of late, we've been as busy as ever on our Mellon-funded ArchivesSpace-Archivematica-DSpace Workflow Integration project.

Amidst the general hustle and bustle here in Ann Arbor, we neglected to mention that the Mellon Foundation approved an extension of our project through October 31, 2016.  While things were on course to be completed by the original deadline of April 30, we decided that an extension was necessary so that our consultants at Artefactual Systems could further refine the interface of Archivematica's new Appraisal Tab and thoroughly identify and fix bugs without rushing.  The extended period of time will also give archivists at the Bentley an opportunity to gain expertise with the new functionality and thereby document workflows that may be shared with the archives and digital preservation communities.

Current and upcoming work on the project includes:
  • Refactoring the Archivematica workflow to support the new packaging functionality (in both the Archivematica ‘Ingest’ pipeline as well as the platform’s ‘Storage Service,’ which is used to track and recompile completed AIPs).
  • Verifying that packaging steps are recorded accurately in Storage Service pointer files.
  • Evaluating PREMIS 2 vs. PREMIS 3 to decide how to best implement the preservation metadata for packing (and implementing PREMIS 3 support, as needed).
  • Implementing user interface changes to support the new workflow (and also allow users to adhere to existing Archivematica workflows and AIP packaging procedures).
  • Establishing (and then verifying) a workflow and protocols to automate the transfer data and metadata from Archivematica to DSpace.
  • User interface changes in the Storage Service.
We'll look to provide highlights of these processes in the coming months...so stay tuned!

Thursday, May 5, 2016

Migrating Files to Preservation Formats: Theory

Hello! It's been a while! On behalf of the Mellon crew, I'd like to first acknowledge that yes, it's been nearly eight weeks since our last post...

...but we're back! Mike, Dallas and I have been busy preparing for an workshop next week at Personal Digital Archiving (PDA) 2016 on free and/or open source tools for digital preservation. My part's on the theory and practice of migrating files to preservation formats, including tutorials for different file types (with single and batch conversion examples using both GUIs and the command line). I thought I'd share it here.

For your convenience, here's a little navigation bar that will be updated as new posts come out.
  • Theory
  • Practice:
    • Still Images
    • Text(ual) Content
    • Audio and Video

First up, some theory. 

I'll say right off the bat that, yes, even though migration is a standard part of our current (and future) ingest procedures, it is probably less important than good organization, description and redundant storage, especially for personal (i.e., lay person-al) digital archives. That being said, technology changes all the time. At some point you'll need a strategy to deal with the fact that you can't read your parent's extension-less Word 2.0 file from 1991 (which, by the way, you know proves that you're the favorite child) on your fancy new computer with your fancy new version of Word, especially because it was saved on a 3.5-inch floppy disk and you don't have a bay for it. You've also probably never heard of DROID or anything like it, so you won't even know in the first place that it's a Word file.

The Performance Model and the Fundamental Nature of Digital Records


I believe I was first introduced to the National Library of Australia's Performance Model, detailed in their article entitled "An Approach to the Preservation of Digital Records," at the DigCCurr Professional Institute back in 2012. As is the case with just about any other model, it's overly simple and, mostly for that reason, it has its issues. It's also a bit dated (2002) and in places this shows. However, I think it does a good job of framing a discussion about migrating files to preservation formats, so we're going to use it!

First, let's think about the world of physical records. You might say it looks something like this:


In this world, a researcher can have a "direct experience" of a record, with just their eyes. No mediation required.

Now, let's think about the world of digital records which, you might say, looks something like this:


In this world, a researcher can't really have a direct experience of a record. If they did, they'd be looking at a bunch of meaningless (to human eyes, at least) 1s and 0s. They might also have escaped from the Matrix. Instead, some sort of process has to be performed on the source (no longer, as we'll see, considered the record), and the thing the researcher interacts with is a type of performance

Put a little more concretely:


A source is basically a data file, like the Word 2.0 document provided as an example above or anything else you can think of that's sitting on your hard drive. The data file is formatted in a particular way, and gets processed by some combination of hardware and software that can understand it. Usually this combination needs to be fairly specific. To quote the article, "a Word source requires the correct version of the Word application, using a Windows operating system, which is installed on a suitable Intel computer" (p. 9). That's not entirely accurate, but I think you get the point. The performance, then, becomes the way that the data file, through the hardware and software, gets rendered on the screen. This performance, the authors argue, is what the researcher is after--not the original data file or the hardware and software.

Make sense?

Migration


Migration, in this context, is just a fancy word for converting a digital object form one data format to another, for example, from Word 2.0 file to a Portable Document Format (PDF) file. We'll get lots of practice with this when we look later at migrating still images and text(ual) content as well as audio and video. 

It's a strategy (I'll talk briefly about others later) that gets employed to handle the digital preservation scenario outlined above. In performance model terms, migration converts a source object from an obsolete or proprietary format into a current or open format so that a current process (hopefully a somewhat less specific combination of hardware and software) can render it. 

Consider this analogous (get it?) scenario from the audiovisual world:


Nitrate film, which requires a projector and screen to produce a moving image, is an unstable source. In this scenario, it gets migrated to video tape, recognized (at least in 2002) as a more stable source. This new source requires a new process to produce a moving image, namely, a VCR and TV. In both cases, though, and this is the important part, it's the performance (i.e., the moving image) that becomes the record that the researcher is interested in. The same is true for migrations like the ones we'll be doing.

The Concept of Essence, or Significant Characteristics


That all sounds fine and dandy until you start to think that changing files sounds (and, in fact, is) pretty risky! How do we ensure the the moving image from the videotape on a VCR and TV is the same moving image as the earlier one from the nitrate film on a projector and screen? I'll quote the article at length here:
The performance model demonstrates that digital records are not stable art[i]facts; instead they are a series of performances across time. Each performance is a combination of characteristics, some of which are incidental and some of which are essential to the meaning of the performance. The essential characteristics are what we call the ‘essence’ of a record.
These essential characteristics (also known as significant characteristics or, sometimes, significant properties, although this usage seems to be falling out of favor) are what's really important about a record; they provide "a formal mechanism for determining the characteristics that must be preserved for the record to maintain its meaning over time" (p. 13).

Consider our Word 2.0 file. It's a type of word processing document. The essential characteristics may include:
  • it's message:
    • the textual content; and
  • the message's qualifiers:
    • formatting such as bolded text;
    • font type and size;
    • layout;
    • bullets;
    • color; and
    • embedded graphics.
These were, at least, the characteristics deployed by the author to get the message across to a reader or to help out with its comprehension.

The lesson here is that migration is a great strategy for overcoming the challenges of digital preservation. If you have a way to check that significant characteristics for source files match on either side of a migration (we'll get some practice with this too in upcoming posts), it's an even better strategy. [1][2]

Alright, enough theory.

The Elephant in the Room: Emulation


Since I'm talking to a bunch of archivists, I'll add that migration isn't the only strategy for overcoming the challenges of digital preservation. It's often contrasted with emulation, an approach that, using the Performance Model, "keeps the source digital object in its original data format but recreates some or all of the processes (for instance, the hardware configuration or software applications such as operating systems), enabling the performance to be recreated on current computers” (p. 12). There are convincing arguments to be made in favor of both approaches, and it seems like they all come down to what one considers to be the true essence of a digital record!

Now, migration definitely has its disadvantages. It's costly and time-consuming (but to be fair, so is emulation) and the actual process, in a production environment, is error-prone.[3] That being said, it's what's available for many institutions, at least in the US (Archivematica, Preservica and Rosetta, for example, all employ this strategy). I'll also reiterate here that migration and emulation are both probably secondary to good organization, description and redundant storage of digital archives, as well as migrating content off of legacy storage media. 

When I first learned about digital curation and preservation, I was taught that emulation was a kind of "will be nice after more research and development, maybe" strategy. I think this perception is still fairly common (we keep our original files around, even the weird ones, for example, just in case that research and development ever happens). However, I'm not so convinced anymore. Emulation as a Service seems particularly exciting for enabling emulation at scale, even at smaller institutions.

I won't pretend to be an expert on emulation, so if you want more background, there's this interesting thread on the Signal (make sure you read the comments) that's worth a read. 

In the end, migration doesn't have to be your exclusive, or even primary, digital preservation strategy. It is a trusted strategy, though, for many libraries and archives. If you'd like to explore it in a little more depth, stay tuned for upcoming posts!

---

[1] Of course, this is the kind of thing that sounds great in theory, but in practice I find it's really hard to define significant characteristics, especially the way we currently try to do it according to file format or type. Pagination of a word processing document, for example, could be an essential or incidental characteristic depending on how that document gets used, or if it needs to be cited. You'd also probably ruffle some feathers if you said explicitly (as the article says implicitly) that file format is an incidental characteristic.

[2] This kind of significant characteristics check (currently, at least) isn't part Archivematica, unless you count checking that a migrated file's size isn't 0. That's not to call them out or anything, just to say that we should cover making this type of customization at Archivematica Camp 2016!

[3] We've done migrations on old word processing documents that resulted in 15-20 pages of random characters on either side of their one meaningful page--the essential characteristics were there, in other words, they were just mixed in with so many incidental characteristics they were hard to find. 

Monday, March 14, 2016

BHL @ code{4}Lib

Greetings, all; I had the great good fortune last week to attend and present at code{4}lib 2016 in Philadelphia.  This was my first c4l conference and it was awesome! Good people, inspiring presentations, and incredible location (the Liberty Bell was just a hop, skip and a jump away!).

I gave a very brief overview of our ArchivesSpace-Archivematica-DSpace Workflow Integration project—if you'd like to hear me speak a mile-a-minute (especially at the end!), please check out this video from the conference livestream:



As the video is kind of grainy, here are the slides themselves:



I'd like to expand upon what's 'On Deck' for us (especially since I ran out of time at the conference), but will save that for another day...  Thanks for stopping by!