Monday, December 12, 2016

October 13-14, 2016 Mid-Michigan Digital Practioners Organization Meeting

It is known by many names. OAISers know it as the "Monitor Technology" and "Monitor Designated Community" functions of the Preservation Planning Functional Entity of the OAIS Reference Model and PLANETS folks know it is as "Preservation Watch." You may even know it as reconnoitering. Whatever you call it, it's the process of information gathering to stay up on current trends, best practices and cutting edge implementations in the digital preservation world so that you can bring them back to your institution.


Here in Michigan, we have an NDSA Innovation Award winning (that's right, NDSA Innovation Award winning!) organization that does just that, recognized at Digital Preservation 2016 for its "highly original and successful organizational model in fostering innovation sharing and knowledge exchange": the Mid-Michigan Digital Practitioners.


This is such a wonderful group of people. Founded in 2013, the mission of the group is to provide an open and local (and free!) forum for conversation, collaboration and networking for professionals working with digital collections in Michigan. Their biannual meetings, which rotate between institutions in the region, are designed democratically via pre- and post-conference surveys and typically draw between 40 to 70 participants and have attracted student groups, practicing professionals, vendors and even the general public.

October 13-14, 2016 Meeting


Mike, Dallas and I attended the most recent meeting on October 13-14, 2016 at the MSU Libraries in East Lansing.

The first day consisted of two workshops:
  • Policies and Strategies for Managing Digital Assets, in which attendees had the opportunity to:
    • fine-tune and define the scope of digital assets for their own institutions;
    • analyze needs and solutions for their digital asset management; and 
    • practice drafting policy or updating existing policy for digital assets.
  •  XML/XSLT, which covered:
    • basic syntax of XML/XSLT;
    • tools for working with XML/XSLT;
    • using MarcEdit to transform MARC data to MODS, Dublin Core, EAD, etc.;
    • editing XSLT stylesheets; and
    • real world examples of XSLT in action. 

As you can see, this was a nice mix of policy and technology.

The second day was full of excellent presentations.
  • Digital Detroit: Getting Started with Video, although billed as a "getting started" presentation on video digitization, was a deep dive into file formats, codecs, containers and compression, as well as capture devices, setup and more.
  • Merging Traffic: Accessing Archival Collections and Museum Artifacts Through a Common Interface discussed the Henry Ford's new Digital Collections portal (and all the behind the scenes magic that went into it).
  • APIs in the Library: Selected Projects that Expand the Local Information Environment was a super informative introduction to Application Programming Interfaces which I highly recommend to anyone whose ever wanted to get started with APIs (e.g., the ArchivesSpace API!).
  • Cataloging Archival Collections: Grouping Collections to Aid Retrieval gave an overview of one strategy for collocating collections in a discovery interface based on collection/research guides that the archivists have created to aid researchers.
  • In Access Update for GVSU, librarians there gave an update on where they were, where they are and where they're going with regard to providing access to digital collections at GVSU.
  • Using Open Refine for Data-driven Decision Making covered the way that one archivist used OpenRefine, a free, open source, powerful tool for working with messy data to cut processing time by approximately 99.8%
  • Turning Your Smartphone Into a Scanner for Cumbersome Projects: Did you know that you can use your smartphone to digitize hard-to-capture paper and photographic items and quickly get them out to researchers? (Hint: You can.)
  • Digital Scanning Course Materials discussed the excellent, progressive work of the MSU Course Materials Program, a one-stop shop at the MSU Library for faculty seeking to use third-party copyrighted content in their courses.
  • Legacy Computer Challenge gave an overview of the challenges encountered when working with two legacy computers, a Mac Classic II (1991) and a Power Mac 6500 (1997), which was a catalyst for discussing a MMDP Best Practices and Tech Exchange.

Finally, during an open mic session we gave an update on our ArchivesSpace-Archivematica-DSpace Workflow Integration project, which wrapped up shortly after that meeting.

All in all, a busy and informative day!

Want to Learn More?

You can learn more about this "regional collective of librarians, archivists, museum curators, conservators, historians, scholars and more engaged in creating and curating digital collections in Mid-Michigan and the surrounding region" on the Mid Michigan Digital Practitioners website or catch up with them on Twitter. You should also feel free to join the listserv.

If you're in the area, the next meeting is March 23-24, 2017 in Detroit... we'll see you there!

Friday, November 4, 2016

The End is Just a New Beginning!

Greetings, all; as hard as it is to believe, the Bentley Historical Library's ArchivesSpace-Archivematica-DSpace Workflow Integration project has come to a close.

October 31 marked the end of two and a half years of intense planning, development, and testing but it also signaled the beginning of a new phase as we here at the University of Michigan begin to implement the project outcomes in a full production environment.

While this development work won't be available until version 1.6 of Archivematica (release date forthcoming), we wanted to take this opportunity to glance backward and also look ahead...

Project Outcomes

In our inaugural blog post back on April 8, 2015, we identified three major development objectives for the project:
1. Introduce functionality into Archivematica that will permit users to review, appraise, deaccession, and arrange content in a new "Appraisal and Arrangement" tab in the system dashboard.
2. Load (and create) ASpace archival object records in the Archivematica "Appraisal and Arrangement" tab and then drag and drop content onto the appropriate archival objects to define Submission Information Packages (SIPs) that will in turn be described as 'digital objects' in ASpace and deposited as discrete 'items' in DSpace.  This work will build upon the SIP Arrangement panel developed for Simon Fraser University and the Rockefeller Archives Center's Archivematica-Archivists' Toolkit integration (as demonstrated around the 12 minute point of the first video here).
3. Create new archival object and digital object records in ASpace and associate the latter with DSpace handles to provide URIs/'href' values for <dao> elements in exported EADs.

 I am extremely pleased to announce that we have achieved each of these outcomes in the development work that concluded on October 31.  More specifically, the project has resulted in:

1. The creation of a new Appraisal and Arrangement tab in Archivematica that will permit users to characterize, review, arrange, and describe digital archives with such features as:

  • Browsing the folder hierarchies of transfers in the "Backlog" pane.

  • Identifying file format distributions (in both tables and pie charts) and sensitive personal information (Social Security and Credit Card numbers) in the "Analysis" pane.

  • Displaying items in a "File List" pane, with contents updated based upon selections in the Backlog and Analysis panes.

  • Previewing content (using available web browser plugins) within the Analysis pane (with the ability to download and locally render other file types).

  • Tagging content (to aid in archival description, the identification of sensitive information, deaccession decisions, etc.) with the added ability to facet by tags in both the Backlog and File List panes.

Archivematica Appraisal and Arrangement Tab
What's in your transfer?


2. The integration of Archivematica and ArchivesSpace, so that users can:

  • Review, create, and edit archival description from ArchivesSpace directly within Archivematica (with information being written back to ASpace via its API) without having to switch between applications/browser windows.

  • Drag and drop content from the Backlog pane onto archival description in the ASpace pane, thereby associating data with metadata (and also establishing a Submission Information Package ready to undergo Archivematica's Ingest procedures).
  • Elect to use the ArchivesSpace functionality (or simply arrange content into Submission Information Packages without employing ASpace) whether or not they have a DSpace repository.
Associating digital content with archival description
Hey, you got your digital content in my archival description!

3. The integration of Archivematica and DSpace so that:

  • Users select a DSpace collection from available Storage Service locations during the 'Store AIP' microservice.

  • The Archivematica Storage Service splits the AIP into two archive files (one for the digital content, which will be publicly accessible by default, and the other for administrative metadata and log files, which will be restricted from public access by default) and automatically deposits them as a new item to the selected DSpace collection.

  • Upon successful deposit, a new digital object record (with the unique DSpace handle URL for the item) will be created in ASpace and associated with the appropriate archival object.
Content in DSpace
My repository has a first name, it's D-S-P-A-C-E...


4. Documentation related to the use of the Appraisal tab.

Max Eckard has produced a manual for use by the Bentley's archivists and student employees and he's looking to contribute to the documentation of the Appraisal tab in the Archivematica version 1.6 user manual.

Bentley Historical Library digital processing manual
It's just a jump to the left, and then a step to the right...

Because a (moving) image is worth a thousand words, I invite you to feast your eyes (and ears) on this rather brisk demo performed by Max:



Next Steps

As mentioned above, with the grant's conclusion we're moving forward with getting these new features implemented in a production environment.  This work is going to proceed on several fronts:

1. Configuring Archivematica to work with our local, highly customized version of DSpace ("Deep Blue").  

The developers at Artefactual Systems worked with an out-of-the-box copy of DSpace (version 5.5); UM's instance (around since 2006) has had a fair number of bells and whistles added to it over the years.  As a result, Max is spending a lot of quality time on Slack with colleagues in Michigan's Library Information Technology group.

2. Customizing metadata fields to be used in Deep Blue to accommodate our qualified Dublin Core (i.e., dc.contributor.author as opposed to dc.creator).


3. Establishing workflows to streamline the deposit of restricted content to the repository.

While our default workflow involves content that will be publicly accessible via DSpace, the Bentley also encounters a decent amount of material that must be restricted from the general public (due to sensitive information, regulations such as FERPA or HIPAA, donor requests, etc.) or that can only be accessed in our reading room (due to copyright issues).  We've established some semi-automated strategies for dealing with these materials and will look at trying to streamline this process.

4.  Identifying and addressing bugs in advance of going live (and the release of Archivemcatica 1.6 release).  

As we've been testing the Appraisal tab, we've reported a number of bugs to Artefactual Systems and also identified some enhancements related to local practice that we will contract with Artefactual Systems to address independently of our grant project.

5.  Training additional staff so that all of our processing archivists and graduate students are using Archivematica to arrange and describe digital archives in addition to their work with physical and analog materials.


Thank you!!!

Finally, we'd like to thank all the following organizations and individuals for their steadfast support and ample contributions to this project!
  • The Andrew W. Mellon Foundation
    • Donald J. Waters, Senior Program Officer
    • Kristen C. Ratanatharathorn, Senior Program Associate
  • The Bentley Historical Library
    • Terrence J. McDonald, Director
    • Nancy Bartlett, Associate Director
    • Angela Clark, Business Administrator
    • Kellie Carpenter, Administrative Assistant
  • The University of Michigan Library
    • John Weise, Associate Director of Library IT and Head, Digital Library Platform & Services
    • Aaron Elkiss, Systems Programmer/Analyst Senior
    • Jose Blanco, Applications Programmer/Analyst Senior
  • Everyone at Artefactual Systems (especially Evelyn, Justin, Sarah, Nick, Holly, Dan, and Radda as well as Misty and Courtney)
  • The readers of this blog and everyone who reached out to us through comments, emails, tweets, and professional meetings. Thank you!!!!  Your questions, comments, and overall interest in the project were profoundly valuable!

We plan to continue blogging about our engagement with innovative archival practice and technology—so please continue to stop by to see what's new here on Beal Avenue.  Until next time, keep on keeping on!

Tuesday, October 18, 2016

Customizing Archivematica's Format Migration Strategies with the Format Policy Registry (FPR)

Over the past couple weeks we've been exploring the ways in which our current normalization strategies (to read them for yourself, see our Format Conversion Strategies for Long-Term Preservation) compare to those in Archivematica. Below you'll find a brief introduction to Archivematica's Format Policy Registry (FPR), an overview of the process we went through to compare our format policies to Archivematica's and a couple of approaches we've taken to reconcile the differences between the two.

Hope you enjoy it!

Format Policy Registry (FPR)

Located in the "Preservation planning" tab, the Archivematica Format Policy Registry (FPR) is a database which allows users to define format policies for handling file formats.

Formats 


In the FPR, "a 'format' is a record representing one or more related format versions, which are records representing a specific file format" (Format Policy Registry [FPR] documentation). As you can see from the example above, the "Graphics Interchange Format" format is made up of 3 specific versions, "1987a," "1989a," and "Generic gif."

Formats themselves are described this way:
  • Description: Text describing the format, like a name.
  • Version: The version number for that specific format.
  • PRONOM ID: The specific format version’s unique identifier in PRONOM, the UK National Archives’s format registry.
  • Access format? and Preservation format?: This is where you indicate that whether something is suitable for access or preservation purposes, or both or neither.
Formats also have UUIDs, are enabled or disabled, and have a number of associated actions (which we'll talk about later). They also have a group, "a convenient grouping of related file formats which share common properties" (ibid.), e.g., "Video." All of this is customizable.

Format Policies

In Archivematica, format policies act on formats. Format policies are made up of:
  • Tools: Tools are things like 7Zip, ImageMagick's convert command, ffmpeg, FFprobe, FITS, Ghostscript, Tesseract, MediaInfo, etc., which come packaged with Archivematica.
  • Commands: These are actions that you can take with a tool, e.g., "Transcoding to jpg with convert" or "Transcoding to mp3 with ffmpeg." Commands can be used in one of the following ways:
    • Identification: The process of trying to identify and report the specific file format and version of a digital file.
    • Characterization: The process of collecting information (especially technical information) about a digital file.
    • Normalization: Migrating/transcoding a digital file from an original format to a new file format (for access or preservation purposes).
    • Extraction: The process of extracting digital files from a package format such as ZIP files or disk images.
    • Transcription: The process of performing Optical Character Recognition (OCR) on images of textual material.
    • Verification: The process of validating a digital file produced by another command. Right now these are pretty simple, e.g., check that it isn't 0 bytes.
  • Rules: This is where you put it all together and apply a specific command to a specific format, saying something like: "Use the command 'transcoding to tif with convert' on the 'Windows Bitmap' format for 'Preservation' purposes." When browsing the FPR you can actually see how well these policies are working out. In our case, this particular policy has been successful for 2 out of 2 digital files we attempted it on.

The first time a new Archivematica installation is set up, it will register the Archivematica install with the FPR server [1], and pull down the current set of format policies. FPR rules can be updated at any time from within the Preservation Planning tab in Archivematica (and these changes will persist through future upgrades). You also have the option of refreshing your version with the centralized Archivematica FPR server version, if you so choose.

Customizing Archivematica's Format Migration Strategies

What follows is our initial foray into customizing Archivematica's format migration strategies. For a more detailed look at this as well as customizing other aspects of the FPR, you should definitely check out the documentation.

What We Do Now

For some context, we've been normalizing files for quite some time. Because we must contend with thousands of potential file formats, a number of years ago we adopted a three-tier approach to facilitate the preservation and conversion of digital content:
  • Tier 1: Materials produced in sustainable formats will be maintained in their original version.
  • Tier 2: Common “at-risk” formats will be converted to preservation-quality file types to retain important features and functionalities.
  • Tier 3: All other content will receive basic bit-level preservation.

These, by the way, are being incorporated into a more comprehensive Digital Preservation Policy which we hope to share with others in the near future...

Comparing Our Format Migration Strategies to Archivematica's

We decided to make some customizations to Archivematica's FPR because some of our existing policies didn't quite match up with Archivematica's. We discovered this by doing an initial comparison of the FPR with our existing Format Conversion Strategies for Long-Term Preservation.

For a detailed list of all of our findings, please see this spreadsheet. Basically, however, here's how things broke down for the 62 formats in Tiers 1 and 2 that I examined in depth:
  • Formats we recognized as preservation formats, and are an Archivematica preservation format.

Examples: Microsoft Office Open XML formats, OpenDocument formats, TXT, CSV and XML files, WAV files, PNG and JPEG2000 files, etc.

  • Formats we recognized as preservation formats, but aren't an Archivematica preservation format. These have a normalization pathway.

Examples: AIFF and MP3 files, and also lots of video: AVI, MOV, MP4.

  • Formats we recognized as preservation formats, but aren't an Archivematica preservation format. These have no normalization pathway.

These were the most varied, including files belonging to the PDF, Word Processing, Text, Audio, Video, Image, Email and Database groups. Examples: PDF/A files, RTF and TSV files, FLAC and OGG files, TIFF files and SIARD files.

  • Formats we didn't recognize as preservation formats, but are an Archivematica preservation format.

These were mostly things like older Microsoft Office formats, mostly. Examples: DOC, PPT and XLS files.

These, by the way, are our most common Tier 2 formats based on an analysis of our already processed digital archives I did for Code4Lib Midwest this year:

As you can see, all but one of the top five Tier 2 formats is one of those older Microsoft Office formats. What can I say? We get a lot of this kind of record!

  • Formats we didn't recognize as preservation formats, and aren't an Archivematica preservation format. For these, Archivematica's normalization pathway is the same as ours.

Lots of raster images here. Examples: BMP, PCT and TGA files.

  • Formats we didn't recognize as preservation formats, and aren't an Archivematica preservation format. For these, Archivematica's normalization pathway is not the same as ours.

These all stemmed from a difference in preferred preservation target for normalized video formats. We typically converted these to MP4 files with .H264 encoding, while Archivematica prefers the MKV format. Examples: SWF, FLV and WMV files.

  • Formats we didn't recognize as preservation formats, and aren't an Archivematica preservation format. For these, Archivematica does not even have a normalization pathway.

Essentially, these were files that we had a normalization pathway for, but Archivematica doesn't. Examples: Real Audio files, FlashPix Bitmap and Kodak Photo CD Image files, and PostScript and Encapsulated PostScript files.

  • Finally, formats we didn't recognize as preservation formats, but not even in Archivematica.

Examples: EML files and other plain text email formats.

Approaches

To be honest, I was a bit surprised by just how different our local practice was from Archivematica's, considering we both look to the same authorities on this type of thing! This diversity led to a number of different approaches to customizing Archivematica's Format Migration Strategies, which I'll briefly detail here.

Do Nothing

For those formats that we agree on, i.e., we both agreed they were preservation formats or we both agreed they were not preservation formats, but shared the same normalization pathway, we didn't do anything! Easy peasy lemon squeazy.

Disable a Normalization Rule, Replace a Format

This we did for formats we recognized as preservation formats, that aren't an Archivematica preservation format but that do have a normalization pathway in Archivematica. Basically, we disagreed with the out-of-the-box FPR and we weren't interested in having Archivematica doing any normalization on these. After we went to check the Library of Congress Sustainability of Digital Formats site to ensure that we weren't totally off...

...we went to the FPR and disabled the normalization rule...

...and verified that we'd done it correctly...

...then searched for the format itself...

...clicked "Replace"...

...and set the format to a Preservation format.




You can also easily verify that Archivematica got the message...
 


Replace a Format

A somewhat simpler approach, this we did when there were formats we recognized as preservation formats, but that aren't an Archivematica preservation format and have no normalization pathway. Since Archivematica didn't really have a better alternative, we stuck to our existing policies.

This was as simple as finding the appropriate format, clicking "Replace"...

...and setting it as a Preservation format.

Create a Command, Edit a Normalization Rule 

This started to get a bit more complicated. We did this for formats we didn't recognize as preservation formats, and neither did Archivematica, but Archivematica's normalization pathway is not the same as ours. Again, these all stemmed from a difference in preferred preservation target for normalized video formats.

For these, Archivematica didn't have an existing command that worked for our purposes (it did have a tool, ffmpeg, that would). We had to write a little something up (which was inspired by other Archivematica commands) [2]...


...create a new normalization command...

...add in the information Archivematica needs for  the new command...

...then go in and replace the rule for the appropriate format(s)...

...select the appropriate command (our new one!)...

...and finally verify that it had been changed.

Create a Normalization Rule

This we did for formats we didn't recognize as preservation formats, and neither did Archivematica, but for which Archivematica does not even have a normalization pathway (and we did). For these, we wanted to have Archivematica use our existing normalization pathway.

To create a new rule, we selected the "Create New Rule" option...

...and entered the new information (purpose, original format and command you want to use) for the file format for which you're interested in created a new policy.


Manual Normalization and Other Thoughts...

That leaves us with a couple of outstanding issues, namely, legacy Microsoft Office documents and EML and other email formats (which Archivematica doesn't recognize at all--because the tools Archivematica uses for file format identification doesn't recognize them or they aren't registered in PRONOM).

The "ubiquity" argument aside, we'd really love to do something about older Microsoft Office documents, especially since currently these are the most common formats that we normalize. At the moment we use LibreOffice's Document Converter to handle conversion to a more sustainable format, i.e., Microsoft Office Open XML. However, Archivematica has looked into LibreOffice with the following results:
  • LibreOffice normalization led to significant losses in formatting information.
  • LibreOffice sometimes hangs, causing any future LibreOffice jobs to fail until an administrator manually kills the service.
  • LibreOffice sometimes reports that it succeeded despite not actually succeeding, making it difficult to determine whether or not the job really succeeded.

There may also be options here for converting to PDF as well, at least for documents. In the interim, we're still examining our options. At the very least we can change the FPR so that these formats are not recognized as preservation formats; we'll be looking into alternative approaches and will plan to report back when appropriate.


As for the email formats, we currently use a tool called aid4mail to convert these to MBOX files. This is a proprietary program, though, and only works in Windows, so we're looking into ways that we might manually normalize these files outside Archivematica (and associate different versions of files with one another inside Archivematica). This can be done, but we're looking into ways of doing this efficiently in batch, however, and again, we can plan to report back when we've got something figured out.

To the FPR and Beyond!

Alright! That's about it for customizing the FPR; I think we've covered (in at least a basic way) all the different angles (with the exception, perhaps, of introducing a new tool to Archivematica!).

By the way, one of the most exciting things about the FPR is that since ours (and yours!) is actually registered with the Archivematica server, one day we all might be able to share this information in a more efficient fashion!

Have you customized the FPR? Are you too excited about the possibility of sharing FPR format policies via linked data? Let us know in the comments!

[1] Format policies are maintained by Artefactual, Inc., who provide a freely-available FPR server hosted at fpr.archivematica.org. This server stores structured information about normalization format policies for preservation and access.
[2] This could also have been written in Python.

Wednesday, October 5, 2016

Transferring Legacy Content into Archivematica's Backlog: `automation-tools` to the Rescue!

First off, these are exciting times at the Bentley. In just about every sense of the word, we are live with ArchivesSpace and, soon, real soon, we'll be live with Archivematica. In fact, I believe at least one of our processors may be in denial about the fact that he'll probably never use AutoPro again:


Remember, Devon, lament is the antidote to denial.

Alright, back to our regularly scheduled program. For us, part of "going live" with Archivematica is ensuring that our current backlog of unprocessed, digital accessions makes its way safely and efficiently into Archivematica's backlog... so that those accessions can be appraised and arranged sometime soon with some new functionality that will be out in version 1.6! Today's post details some of the analysis we did on our current backlog, some of the "pre-transfer" manipulation we've done to the digital accessions in it, the Archivematica environment we're using (at least for now) and the automation-tools setup we're using to automate the process as well as some of the errors we've encountered so far.

Our Current Backlog

As tempted as we were to just dive in, we knew we needed to do a bit of analysis on our current backlog, just to know what we were up against. Here's some basic stats we gathered that informed our subsequent decision-making processes.

Overall:
  • Total deposits: 209
  • Total size: 3.6 TB
  • Total number of files: 5,187,920 

Averages:
  • Average size: 17.2 GB
  • Average number of files: 5683.8

Top ten deposits by size:
  1. 806.6 GB (21.6 %)
  2. 797.1 GB (21.4 %)
  3. 705.6 GB (18.9 %)
  4. 180.4 GB (4.8 %)
  5. 154.1 GB (4.1 %)
  6. 133.2 GB (3.6%)
  7. 128.2 GB (3.4%)
  8. 126.5 GB (3.4 %)
  9. 121.1 GB (3.2 %)
  10. 67.8 GB (1.8%)

For reference, only 6.7% of our deposits make up 99% of our total size; that means we're talking about a pretty long tail here.

Top ten deposits by number of files:
  1. 4,182,060
  2. 208,577
  3. 178,265
  4. 155,198
  5. 154,187
  6. 89,431
  7. 49,201
  8. 17,883
  9. 17,786
  10. 17,541

For reference, only 8.1% of our deposits have more than 5,000 files, so another really long tail.

Ensuring that Archivematica Could Handle this Backlog

Based on a couple of relevant conversations on the Archivematica listserv, we knew that Archivematica, like ArchivesSpace and Pokemon Go and just about every other piece of software out there, can have some scalability and performance issues when you start throwing a lot at it.

There's a couple in particular that we returned to again and again:
  • Archivematica Data Flow: This one gave us Justin Simpson's heuristic for processing space, and taught us that we'd need consider the largest transfer we'd want to process at once, and allocate 3 to 4 times as much space in our processing location (i.e., /var/archivematica/sharedDirectory). Note, this is not the same as storage space, this is just the amount you'll need as Archivematica runs through it's various micro-services.
  • Elasticsearch indexing AIPs with many files: These basically just taught us to keep an eye out on Elasticsearch. Indexing makes about "a zillion" queries (quoting our system administrator here), and with every query Archivematica has to look through all the files in a given transfer, once per file (still quoting). So it can take a while, especially for transfers with lots of files. In fact, this is our most persistent source of frustration during testing, this round and previous rounds.
  • Archivematica scalability on very large object: This one taught us that Archivematica can handle really large files (our biggest, if I had to guess, is probably a ~65 GB video file) and that number of files is really what we should be looking out for.
  • Also, we learned that transfers of 80,000 or 80,000 files just won't work, no matter what setup you've got. This we learned from an internal e-mail but apparently it's out there on the list somewhere.

In the end, there are basically two ways you can ensure that Archivematica doesn't run into scalability and performance issues. You can: 1) reduce the size or volume of your transfers; or 2) up the power of your Archivematica environment. After some back and forth with our system administrator, we ended up doing a little of both.

Pre-Transfer Manipulation

We've settled on somewhat arbitrary parameters for our transfers: no more than 50-60 GB or 50,000 files. For the handful of transfers we encounter that are over that amount (I'm looking at you, Michigan governors and congresspeople), I've manually broken them up and given them a sequential suffix (i.e., _001, _002, _003, etc.) that will become the "name" of the transfer in Archivematica. We'll keep them together with the same accession by ensuring that the accession number in Archivematica is the same. We've even worked this into our procedures going forward.

Archivematica Environment

Here's what we're working with so far. The VM has:
  • 16GB of RAM
  • 4 virtual CPUs
  • ~350GB of disk for processing

Our system administrator also set Elasticsearch's ES_HEAP_SIZE to 2G from the default of 640m. Finally, he made some tweaks to mysql, mostly guesswork based on mysqltuner:
  • key_buffer_size to 512M from the default of 16M
  • query_cache_size to 256M from default of 16M
  • query_cache_limit to 8M from default of 1M
  • innodb_buffer_pool_size to 512M from default of 128M
  • max_heap_table_size to 256M and tmp_table_size to 256M

Finally, he commented out the print statement in index_transfer_files in elasticSearchFunctions.py and added an index on currentLocation (this particular change will make it into Archivematica proper as you can see for yourself on this pull request). You can see the explanation on the Archivematica Tech List.

Aaron in awesome! These tweaks took our "Index transfer contents" step from 5 days on a 50,000 file transfer (when we got tired of waiting and quit) to about 2 hours. What!!!!?!?!?!!? Did I mention that Aaron is awesome!

Enter automation-tools


automation-tools are a set of Python scripts that are designed to automate the processing of transfers in an Archivematica pipeline. They are what's preventing me from manually starting and checking on all 209 (now 249 since I broke some of them up) transfers.

Installation


You won't need to if you're using the Ansible scripts, but if not, you can install automation-tools using the instructions on the README. We ended up doing it both ways here on different versions of Archivematica.

Setup and Configuration/Customization

There's a couple of different ways to customize the automation-tools. One is by specifying a default processing configuration, which you can do by using the Administration tab in Archivematica to set the default processing configuration, then copying the processing configuration file from /var/archivematica/sharedDirectory/sharedMicroServiceTasksConfigs/processingMCPConfigs/defaultProcessingMCP.xml to the transfers/ directory of your automation-tools installation location. Here's our default processing MCP if you're interested. That way, all transfers will get that processing configuration, even if you've changed the one in the dashboard to something different.

Another way to customize is by creating "hooks," which tweak the behavior of automation-tools. There are a couple of options here, but one we've found particularly useful is the get-accession-number hook. This automatically fills in the accession number for a given transfer. Our version of this script takes the folder name (this is the only variable you have to work with, which for us is currently based on a local, "Digital Deposit ID" convention), looks that up in a simple dictionary of Digital Deposit IDs and Accession numbers, which was exported from our local FileMaker Pro database. It even accounts for transfers that have been split up and those that (gasp!) don't have accession numbers for one reason or another. All this serves to ensure that when we do a search through the backlog, we get every transfer associated with a particular accession, even if there's more than one!

Note: It's easy to miss, but you must name this script "get-accession-number" (without an extension). Don't use "get-accession-id" (despite what the instructions say), "get-accession-number.py" or create a folder called "get-accession-number" with a script inside of it. Yes, we made all of those mistakes...

There's also an option for pre-transfer hooks. We haven't explored these yet, but I can see using this to, for example, ensure that an Accession record with a particular ID and/or extent is already in ArchivesSpace (or, if not, even creating one on the fly when a transfer to Archivematica is made).

Running

You can test the automation-tools by running them with the following command:
/usr/share/python/automation-tools/bin/python -m transfers.transfer --user <user> --api-key <apikey> --ss-user <user> --ss-api-key <apikey> --transfer-source <transfer_source_uuid> --config-file <config_file>

Here's what all this means:
  • /usr/share/python/automation-tools/bin/python
The first part of that command tells the automation-tools to use a particular version of Python (the one with particular libraries that you installed earlier). 

  •  transfers.transfer
You then tell it which script to run (for us, for now, the transfer script--you can also run ingest scripts). 

  •  --user <user> --api-key <apikey> --ss-user <user> --ss-api-key <apikey>
Next you give it your credentials for both Archivematica and the Storage Service, which you can find by navigating to your user profile in Archivematica and the Storage Service, respectively. 

  • --transfer-source <transfer_source_uuid>
Transfer source tells the automation-tools where to look for transfers (you can find the UUID in the Storage Services).

  • --config-file <config_file>
The configuration file tells automation-tools where to store the database, log files and a "-pid.lck" file which keeps the automation-tools from starting a new transfer when one is already going. By the way, every once in a while this seems to stick around when it should so you have to go there and delete it.

We also add the following options:
  • --transfer-type 'unzipped bag'
This simply lets the automation-tools know what kind of transfer you're doing. We have to specify this since we're not using the default of 'standard'.

  • --hide
This hides the transfer from the dashboard when we're through. As we learned from this conversation on Managing the dashboard (transfer and ingest), you've got to keep this cleaned up or else you'll run into some browser timeout issues.

  • --verbose
This increases the debugging output because, well, why not?
You could also write a shell script (don't forget to chmod +x transfer-script.sh to make it an executable first!) and when you're ready to go, you can set up a cron job (or in our case, three) to automatically run different versions of the script (that point to different source locations or have different parameters) at given intervals. Here's what our crontab entry looks like:

0,5,10,15,20,25,30,35,40,45,50,55 * * * * /usr/lib/archivematica/automation-tools/transfer-script_legacy.sh
1,6,11,16,21,26,31,36,41,46,51,56 * * * * /usr/lib/archivematica/automation-tools/transfer-script_bhl-digitalarchive.sh
2,7,12,17,22,27,32,37,42,47,52,57 * * * * /usr/lib/archivematica/automation-tools/transfer-script_bhl-dropbox.sh


This runs the three automation-tools script we have (one for legacy transfers, two for current transfers from two different source locations to which we'll move transfers from separate staging locations) at regular intervals (every five minutes), but ensures that they never run at the same time. Be prepared for about "a zillion" emails (for real) from some guy named "Cron Daemon" :).

Once you have this going, sit back, relax, party like a partyparrot and never look at the Archivematica transfer dashboard again!


Errors Encountered Thus Far

That is, until you hit an error. And we've hit a few. Here's what our inboxes looked like after the first weekend of letting automation-tools go:




This, I'll admit, looks like a lot of failures, except when you remember that many more transfers went through without issue. If we can count what's gone through so far as a sample, we have about a 80% success rate. Excluding the errors we got due to some permissions issues we eventually worked out, here's a couple of common errors we got and how we're handling them:
  • Approve transfer micro-service, Verify bag, and restructure for compliance job: This is the most common. Generally, this has been because some thumb.db file has snuck in or out, and just requires a simple bag update to get things going again.



  • Approve transfer micro-service, Assign checksums and file sizes to objects job: The only commonality I could see in these is that many of them look like they have a weird character in the file name or path (e.g., "black body 1025.mov" or "FY2014/archival files/thach_diane_2014/Fireside_Preview_Panel7.svg", or look like a hidden file that starts with a period (like "._ 2002 - Mary Sue Coleman"). Actually, we're still looking into this error--let us know if you've got any ideas!
  • Scan for viruses micro-service, Scan for viruses job: Yep, this actually happened. It's an example of a time where our local virus scanner (we use System Center Endpoint Protection) didn't catch something that Archivematica's (ClamAV) did. Because of the discrepancy we thought it might be a false positive. In any case, this happened to be something that was mass produced (actually from the development branch of the university... hmm...), so I simply got another copy and got rid of the old one.
  • Create SIP from Transfer micro-service, Index transfer contents job: This one never actually failed, it just went very slowly. See the tweaks we made above.

Conclusion

And so begins our Archivematica journey, one transfer at a time! Recently we got the full demo, pre-SIP to AIP to DIP from Artefactual... we're super pumped! In a few short weeks, we'll be live "for real" with Archivematica here. Stay tuned for more!

Friday, September 16, 2016

On Square Pegs, Round Holes, PREMIS Rights Statements and Apollo 13

As mentioned before in a previous post on PREMIS and PREMIS Rights Statements, we've been exploring ways that we can create rights statements as we're processing SIPs in Archivematica and then use those rights statements to set access profiles for the AIPs in our DSpace repository.

At that point, our our thinking was mostly theoretical. Since then, we've had some time to think about it, to confer with our MLibrary colleagues as well as those at the Rockefeller Archive Center and even reflect on Ed Pinsent's comments on the last post (thanks, everyone!). In this post, I'd like to give an update on how we plan (yes, still just a plan--things could change!) to actually do it. Before I dive in, though, I should remind our readers that what we're proposing here is a bit like trying, as the expression goes, to fit a "square peg in a round hole." Here's a quote from the PREMIS Data Dictionary for Preservation Metadata:
PREMIS primarily defines characteristics of Rights and permissions concerned with preservation activities, not those associated with access and/or distribution.

Yikes. "Not those associated with access and/or distribution." hrm...

The Access Profiles

Let's start at the end. In DeepBlue, our DSpace repository, we have some amount of control over both a digital object--or, to use DSpace-speak, bitstream(s)--and its associated metadata--or item. We can associate each of them (independently of one another) with one of four (or actually as many as we care to create) of what are called groups. In practice, we apply a handful of common combinations of item and bitream(s) groups when we deposit AIPs in our DSpace repository:
  • Open: Both items and bitstreams are open to be viewed/downloaded by anyone in the whole world.
  • Bentley Reading Room users: While items can be viewed by anyone, downloading of bitstreams must be done from within the Bentley's IP range. This can be from a wired or wireless connection.
  • University of Michigan users: Items can be viewed by anyone, but only University of Michigan affiliates may download bitstreams.
  • Totally restricted/embargoed items: Nobody (except Bentley archivists, who may also fulfill reference requests) can view or download anything, item or bitstream(s). Typically, these types of things are embargoed until a particular date (based on local policies), at which point in time both item and bitstream(s) will become open.[1]
  • Audio/visual items with copyright or other types of concerns: Items can be viewed by anyone, but only Bentley archivists can download bitstreams. Heretofore this profile consists mostly of audio/visual material that is preserved in DSpace but made available for streaming (not downloading) in the Bentley Digital Media Library.

A Quick Refresher on Act[ion]s in PREMIS Rights Statements

As a reminder, PREMIS Rights Statements are made up of one basis (the raison d'ĂŞtre of the rights statement, something like copyright or policy) and one or more actions associated with that basis (these are very specific actions the repository is or isn't allowed to do). Since the basis won't have an impact on its associated action, I won't go into much detail about them here.

Actions come from a controlled vocabulary, made up of things like:
  • replicate: make an exact copy
  • migration: make a copy identical in content in a different file format
  • modify: make a version different in content
  • use: read without copying or modifying
  • disseminate: create a copy or version for use outside of the preservation repository
  • delete: remove from the repository

As you can see, these have a very "digital preservation" feel (in the most narrow sense of the word[2]). Hence the data dictionary's warning above.

Actions may be allowed in all cases or, of course, they may have restrictions. These express situations where, for instance, dissemination is permitted, but only to a specific type of person (say, one that's affiliated with your institution), or, taken to the extreme, that dissemination is not permitted, period. At least in Archivematica's case, you've got three choices to that express such restrictions: allow, disallow or conditional. This may sound like it covers a lot, but as you'll see, we had to get a little creative with this, as we end up using "conditional" to describe a number of different conditions.

Other than that, there are some begin and end dates associated with that action, and a note containing a textual description of the right granted if additional description is needed.

Mapping PREMIS Rights Statements in Archivematica to DSpace Groups

SIP rights template--second page



Now on to mapping the PREMIS Rights Statements implementation in Archivematica to the groups in DSpace. There's a couple different ways we might have approached this.

One way might have been to try to use an Act to tell the repository exactly what it was allowed to do with both the item and the bitstream for a particular AIP. While this approach gave us the granularity we'd need for machine-actionable PREMIS Rights Statements, we worried that it would be overly cumbersome for our human processors that would be, for the most part, manually adding them to SIPs and keying in the data.

Another way might have been to use a local controlled vocabulary for the Act field, something like "disseminate-bentley", "disseminate-umich", etc. However, associating a particular target with an action seemed, in the word of one of our DSpace gurus, "somewhat contrary to the spirit of the allowed actions" (see this sample controlled vocabulary for the 'act' element, some of which I listed above). You'll also notice if you click that link that "disseminate-bentley", "disseminate-umich", etc. are not on that list, and for good reason! We even thought briefly about using the Restriction field to specify the target audience before realizing that it too has a controlled vocabulary (one that's actually enforced by Archivematica and then used later on in some logic).

In the end, we settled on using the Note field to specify audience. Now, we know this isn't the most elegant solution--in general, the intention of the notes is specifically to not be machine-actionable, but we felt that since this PREMIS Rights Statement would ultimately be preserved in the AIP (in the METS!), and since there's a chance someone might run across it outside of our repository environment, that this was the way to go.

So here's our plan, at least an overview:

DeepBlue Groups

Archivematica PREMIS Rights Statements

Item

Bitstream

Act

Restriction

Restriction note

Anonymous
Anonymous
None
None
None
Anonymous
Reading Room only
disseminate
Conditional
Reading Room
Anonymous
University of Michigan only
disseminate
Conditional
University of Michigan
Anonymous
Archivists only
disseminate
Conditional
BDML
Archivists only
Archivists only
disseminate
Disallow
Executive records (ER)
Personnel records (PR)
Student records (SR)
Patient/client records (CR)

A couple of notes here:
  • We will not use PREMIS Rights Statements (at least those that apply to access/distribution) for AIPs that don't have restrictions.
  • When we do use PREMIS Rights Statements, they will be as minimal as we can make them with the intention that they will only be used by machines, not humans. Human readable rights statements will be recorded elsewhere, like ArchivesSpace Conditions Governing Access and Use notes. 
  • Most of the time, the End Date field will be OPEN, except when a Bentley policy is involved (ER, PR, SR and CR above). In those cases, an end date will let the repository know when a particular restriction expires.

Once an AIP with some sort of restriction is ready to go to DeepBlue, we'll park it somewhere temporarily[2], parse the METS file in the AIP, determine (based on the rights statements) the item and bitstream permissions, convert it to the DSpace Simple Archive Format and upload in batch to DeepBlue from there. It's sounding like the identifier for the Digital Object in ArchivesSpace will be in the AIP, so we're pretty confident we'll also be able to add the Handle back to ArchivesSpace farily easily as well.

We also think (hope!) that this approach, as long as we're consistent, would allow us to change our minds relatively easily in the future, say, if we decided after all that a more granular approach was the way to go.

But Wait! "disseminate" is Hard to Spell!

It occurred to us that in order for this approach to work, our processors can never make typos. We've all been there... this is a pretty unrealistic expectation.

For the time being, we're planning to use Greasemonkey (in Firefox) and Tampermonkey (in Chrome) to help us out with this particular problem. These are browser extensions that customize the way a web page displays or behaves using small bits of JavaScript.

We've written a fairly basic script (you can see our draft here), that looks for URL patterns that match the Add Act pages in Archivematica (as you can see in that script, http://sandbox.archivematica.org/transfer/*/rights/grants/*/ and http://sandbox.archivematica.org/ingest/*/rights/grants/*/). When it finds one, it adds an additional dropdown, like so...

It even has a nice logo!


When an option is chosen (Reading Room was chosen above), it automatically fills out the rest of the form, just like we need it. When a Bentley policy is involved (that requires an end date), it asks a processor for a creation or accession date (still working on a nice datepicker option for this), does some math, and calculates the appropriate end date. It's not the most elegant solution but we think it works for now!

Conclusion


In the end, it's perhaps a little clearer as to why PREMIS wasn't really meant for this kind of thing. Still, maybe square pegs sometimes do fit into round holes...


Seriously, though, let us know what you think!


[1] Although these types of things are not viewable, downloadable or even searchable in DSpace, typically we still provide a link to them in the collections finding aid. 
[2] Philosophically, I'd argue that access and distribution is a fundamental part of digital preservation... maybe the most fundamental part.
[3] At the end of the grant, AIPs without restrictions will be automatically uploaded to DSpace and recorded in ArchivesSpace without any more human intervention!

Tuesday, September 6, 2016

This One Time, At ArchivematiCamp...

While it's been a bit over a week since the inaugural ArchivematiCamp (or, as my colleagues Max and Dallas prefer, "Archivematica Camp") was held here in Ann Arbor, we're still basking in the afterglow... 36 campers and 5 counselors braved the rain and mosquitoes to gather at the University of Michigan's School of Information for two and a half days of discussions on microservices, metadata, and the mechanics of our favorite digital preservation system.  The camp's full agenda will give you some idea of the variety of topics covered in the 'Curator' and 'Technologist' streams—or maybe you were following on Twitter:



While I would be hard-pressed to summarize all the events and discussions, I did want to talk a little bit about Dallas and Max's demonstration of the new Appraisal Tab functionality we've developed as part of our grant project (and which is slated for release in version 1.6 of Archivematica).  In the Q and A period following the demo, counsellors Ben Fino-Radin and Kari Smith helped kickstart a conversation about how the functionality in the Appraisal Tab could be complemented and supplemented by additional external tools/platforms.

As one example, Ben noted that his work with audiovisual materials requires advanced technical metadata extraction and codec characterization that has not always been available in Archivematica.  (As I understand from my notes, the MediaTrace report produced through a collaboration between MoMA and MediaArea is now available in Archivematica.)

Kari brought up the possibility of integrating an email processing tool like ePADD into a workflow that also involves Archivematica.  Given the unique functionality (and awesome interface) of this platform, it doesn't really make sense to replicate it in Archivematica or to cram another full-featured external tool into the Appraisal Tab.

Instead, as we discussed in our previous post on the Archivematica Users' Group meeting at SAA, we should look at ways of establishing/facilitating 'handshakes' between platforms so that the data and any associated metadata (especially preservation or technical) can be passed along and incorporated into the Archivematica METS or maybe even acted upon by Archivematica.  For instance, if you ran bulk_extractor on a disk image in the BitCurator environment, it would be nice to reuse those scanner reports in Archivematica instead of having to run them again.

We're really excited that other members of the archives and digital preservation communities are thinking about how the work we've done with the Appraisal Tab can be adapted or extended to satisfy local needs and workflows! In the same spirit, Kari also asked if DIPs could be produced and likewise tied back to ArchivesSpace (yes, by extending the code!) and Ben (was this Fino-Radin or Goldman?  I'm leaning towards the latter...) asked about the possibility of creating ArchivesSpace event records based upon actions in Archivematica (totally feasible--just need some coding!).

We're hoping to blog a bit more about camp in some upcoming posts, so I'll wrap things up here by noting that my only regret from camp was the absence of the long-promised 'goodbye song':

Thank heavens the Internet can fix anything!



Tuesday, August 9, 2016

Archivematica Users Group @ SAA

Greetings, all! The Bentley's Mellon grant team had a busy and exciting time last week in Atlanta during the annual meeting of the Society of American Archivists (SAA).  One of the highlights was Dallas's and my demonstration of current functionality in our ArchivesSpace-Archivematica-DSpace Workflow Integration project during the Archivematica Users Group meeting (hosted by the ever-gracious Dan Gillean) on Wednesday, August 3.
We've given a lot of demos over the past year, at conferences as well as to individual institutions and groups (including the Digital Preservation Coalition), but this presentation really stood out for us.  While we always get a lot of great questions from folks, several individuals suggested new and exciting functionality that could be added to the Appraisal Tab in future releases.

Thinking Bigger (and Better!)

First, Seth Shaw, Assistant Professor of Archival Studies at Clayton State University (and developer of the Data Accessioner), pointed out that a tree map visualization would be a helpful addition to the 'Analysis Pane' in the Appraisal Tab.

As it now exists, the Analysis Pane includes a tabular report on file format distributions in a given transfer as well as pie charts that depict this range by (a) number of files per format and (b) total volume of the respective formats:

Analyze this!

A tree map would give archivists an alternative means of visualizing the relative size of directories and files and give insight to where given file types are located. This information could be very helpful in terms of comprehending directory structure, understanding the nature of content in a transfer, and identifying content that might require additional resources during ingest (such as large video files). 

It's also important to note that not all tree maps are created equal, as different instantiations have different affordances.  For instance, TreeSize Professional yields a visualization that includes labels of directories and file formats and uses color coding to show the relative depth of materials in the folder structure of a transfer, but doesn't represent individual files:

Whose size? TreeSize!

WinDirStat, on the other hand, color codes individual file formats, represents individual files in the tree map, and highlights directories or file format types based upon the user's selection from its directory tree or file format list:

The Colors, Children!

Next, Susan Malsbury, Digital Archivist at NYPL, asked about the potential of including Brunnehilde in the Analysis Pane.  For those of you who are not in the know (which very recently included me!), "Brunnehilde" is (to quote its developer, Digital Archivist Tim Walsh)
a Python-based reporting tool for born-digital files that builds on Richard Lehane's Siegfried. Brunnhilde runs Siegfried against a specified directory, loads the results into a sqlite3 database, and queries the database to generate CSV reports to aid in triage, arrangement, and description of digital archives. Reports include:
  • Sorted file format list with count
  • Sorted file format and version list with count
  • Sorted mimetype list with count
  • All files with Siegfried errors
  • All files with Siegfried warnings
  • All unidentified files
  • All duplicates (based on a Siegfried-generated md5 hash)
Walsh's tool could provide much more granular information about the contents of a transfer and when combined with visualizations it would offer additional and highly interesting ways to review and appraise digital archives.

Malsbury also introduced a question of how the Appraisal Tab's new functionality could accommodate disk images.  While Archivematica does have some support for transfers comprised of disk images, our use cases for the grant project did not specifically address this content type.  As Gillean noted, this question begs for additional cross-platform workflow integration.  Since BitCurator is designed to handle disk images, it makes sense for members of the open source digital archives community to explore how it can work in conjunction with Archivematica rather than replicate its functionality in the latter platform.  (A sidebar conversation Max Eckard and I had with Sam Meister from Educopia and the BitCurator Consortium confirmed that this is an important area of inquiry...)

Next Steps...

We're in the final stretch of our grant project and—sad as it is to say—have come to realize that all the awesome ideas we've had for the Appraisal Tab aren't going to make it into the final product.  We will, however, have achieved all the major goals and deliverables that we established at the outset:
  • Introduce functionality into Archivematica that will permit users to review, appraise, deaccession, and arrange content in a new "Appraisal and Arrangement" tab in the system dashboard.
  • Load (and create) ASpace archival object records in the Archivematica "Appraisal and Arrangement" tab and then drag and drop content onto the appropriate archival objects to define Submission Information Packages (SIPs) that will in turn be described as 'digital objects' in ASpace and deposited as discrete 'items' in DSpace.   
  • Create new archival object and digital object records in ASpace and associate the latter with DSpace handles to provide URIs/'href' values for <dao> elements in exported EADs.
All the same, we're thrilled by the realization that the Appraisal Tab as it will exist in the upcoming version 1.6 of Archivematica is really just a beginning.  By developing the Appraisal Tab and introducing basic appraisal functionality (file format characterization, sensitive data review, file preview, etc.), we've dramatically lowered the bar for other institutions that want to integrate new tools or introduce new features.  (And yes, I did borrow liberally from Dan Gillean for that last thought!)  

We're really excited to see where other institutions and developers take the Appraisal Tab because—I, for one, would love to see textual analysis and named entity recognition tools like those in ePADD (or the other projects identified by Josh Schneider and Peter Chen in this great post from the SAA Electronic Records Section blog).  

What features or functionality would you like to see in the Appraisal Tab?  What questions do you have about our current processes? Please reach out to us via the comments section or email.

Thanks for reading; live long and innovate!