Known for their active community and catchy slogans, Archivematica is a “web- and standards-based, open-source application which allows your institution to preserve long-term access to trustworthy, authentic and reliable digital content.” [1]
Archivematica Kick Off
Mike has already mentioned the Artefactual Systems site visit we hosted in January of this year. It was during this site visit that folks from the Library Information Technology division and Artefactual Systems, Inc. installed Archivematica and the Archivematica Storage Service. Unlike our installation of ArchivesSpace, our installation of Archivematica is currently hosted, maintained and supported (thanks, Aaron!) by the Library Information Technology division. When I started in late February, one of my initial work priorities was testing this local implementation in order to determine how Archivematica might "replace and extend" existing workflows and procedures.
Background
Note: Our proposal to the Andrew W. Mellon Foundation gave a great background to the problem that the ArchivesSpace-Archivematica-DSpace Workflow Integration project is trying to solve. That document isn't public, but much of this introduction is gleaned from that portion of the proposal which details our institutional context.
A previous post detailed the history of digital curation at the Bentley Historical Library. Relevant to our immediate conversation is Automated Processor (AutoPro), a homegrown tool that--if you couldn't guess from its name--automates digital processing from "Initial Survey" through "Deposit Content in Deep Blue" using 33 Windows CMD.EXE shell scripts that control more than 20 applications and various command line utilities:
I know from personal experience that AutoPro has been, and continues to be, an effective processing tool. In fact, it has received a number of accolades, being recognized by conference reviewers at iPRES 2012 as:
- "[A] successful implementation of various tools into a a successful institutional workflow [...that] will be relevant to other implementers."
- "[A] useful breakdown of the workflow steps used to process unstructured documents for ingest into an archival repository."
- "[A] sound methodology including automated metadata generation following the EAD and PREMIS standards and creation of an audit trail."
A recent review of how to more efficiently process and deposit unstructured archival content (Microsoft Office documents, images, audio, video, &c.) in Deep Blue, however, determined that while AutoPro has been an effective processing tool, it is not an ideal solution, as:
- Component programs installed on individual workstations must undergo frequent updates.
- Windows CMD.EXE scripts have a limited capacity to handle errors and exceptions and little text-processing capabilities.
- Scalability becomes an issue, as very large files or large collections can require a large amount of workstation resources.
This review just happened to coincide with enhancements to the University of Michigan Library's repository infrastructure and an increased budget for digital archives storage at the Bentley Historical Library. As a result, the Library Information Technology division recommended that we investigate Archivematica as an alternative to AutoPro, citing the following advantages:
- A graphical user interface available via a web-based dashboard.
- A "client/server processing architecture [that] allows it to be deployed in multi-node, distributed processing configurations."
- Support for "large-scale, resource-intensive production environments" that would permit archivists to ingest and process simultaneously multiple large deposits of digital archives.
- "Highly scalable configurations" that would permit granular control of settings for individual Virtual Machines (VMs) according to the size or contents of given Submission Information Packages (SIPs).
- Ability to "control and trigger specific micro-services."
- Improved exception handling and various notifications, "includ[ing] error reports, monitoring of [system] tasks and manual approvals in the workflow."
- Simplified "alteration of preservation plans and user access levels." [2]
Initial Testing
Archivematica Dashboard |
Using relevant procedures and workflows from Archivematica’s Testing page, I ran a number of representative transfers through Archivematica’s transfer and ingest micro-services using a variety of processing configurations.
In addition to the sample transfers provided by Artefactual Systems, Inc. (some of which were intentionally designed to trigger Archivematica failures, such as the "Scan for viruses" micro-service), I tested a number of in-house transfers that had been previously run through AutoPro. These included all types of digital objects:
- websites;
- text(ual) materials like PDFs and Word documents;
- spreadsheets;
- images;
- email; and
- audio/video files.
Some of these were hierarchical in nature, and some were flat. One transfer that was exceptionally large (about 10.7 GB, although that's only a small percentage of the total SIP). I also experimented with a disk image to test the new Forensic disk image ingest feature of Archivematica (released in September 2014) and a collection of sample files with personally identifiable information intended to test Archivematica’s existing integration with bulk_extractor.
Findings
Our main interest in all this testing was to find out if and how Archivematica would "replace and extend" the Bentley Historical Library's existing procedures (i.e., AutoPro).
Replacing AutoPro
After some initial trial and error (we've had some permissions-related trouble related to indexing and storing transfers and Archival Information Packages (AIPs), but I believe most of that is related to the way we have our server set up here) and communication with the Library Information Technology division, nearly all transfers were able to be ingested (I'll get to the one exception in a bit).
Most of the steps in the Bentley’s current digital processing workflow utilizing AutoPro can be replaced by one of Archivematica’s micro-services:
AutoPro
Workflow Step
|
Archivematica
Micro-Service
|
Virus
scan
|
Scan
for viruses
|
Create
temporary backup
|
Create
transfer backups
|
Open
archive files (.ZIP, .TAR, etc.)
|
Extract
packages
|
File
and folder name normalization
|
Clean
up names
|
Identify
missing file extensions
|
Characterize
and extract metadata
|
Create
preservation copies
|
Normalize
(Normalize preservation)
|
PII
(credit care and Social Security number) scan
|
Examine
contents*
|
Appraisal
and arrangement
|
[Appraisal and Arrangement tab] |
Descriptive
and administrative metadata creation
|
Metadata
|
Extract
technical metadata
|
Characterize
and extract metadata
|
Transfer
content (with metadata) to long-term storage
|
Store
AIP
|
Clean
up
|
Store
AIP (Remove processing directory)
|
There are two notable exceptions (in red).
Notable Exception #1: Appraisal and Arrangement
The notable exception is AutoPro’s “Appraisal and arrangement” step, for which there exists no comparable Archivematica micro-service. This functionality is very important to us. While it's true that additional steps are needed in the digital world to ensure the authenticity, integrity and security of content, digital processing is first and foremost traditional processing (this is also why we have one Curation division here at the Bentley Historical Library, not two). Traditional archival functions like appraisal, arrangement and description are just as important in the digital world as they are in the paper world.
This is why we are partnering with Artefactual Systems, Inc. to develop an Appraisal and Arrangement tab in Archivematica. We consider this functionality a high priority, and as such it is part of the first phase of development. The mockup below is what we're working on during the first sprint; it's the Transfer Backlog pane (the "appraisal" part). The final product will also include an ArchivesSpace pane (the "arrangement" part).
Be sure to keep an eye out on this page of the Archivematica wiki for the latest and greatest version of the Appraisal and Arrangement tab. |
Notable Exception #2: PII
A second exception has to do with Personally Identifiable Information (PII). While the “Examine contents” micro-service of Archivematica does replicate AutoPro's functionality to identify documents that may contain PII, it does not replicate its ability to redact PII (via Identify Finder's “Scrub” functionality), and it does not currently replicate its ability to “Shred” or securely delete files containing PII.
As it turns out, the University of Michigan has decided to pull support for Identify Finder, so this is a bit of a moot point. However, part of our proposed feature development with Artefactual Systems, Inc. also includes introducing functionality in Archivematica to act on some of the bulk_extractor reports it is currently running on transfers. For example, we hope to be able to apply machine-actionable PREMIS rights statements to files and folders identified using the accounts scanner (or others) in bulk_extractor, which looks for credit card numbers, credit card track 2 information (the magnetic stripe data track read by ATMs and credit card checkers), phone numbers, and other formatted numbers. We would then use this metadata to automatically embargo or restrict access to content in Deep Blue.
Extending AutoPro
A number of Archivematica micro-services would actually extend the functionality of AutoPro, giving the Bentley the ability to:
- automatically create UUIDs for transfers, SIPs and files, uniquely identifying and directly associating transfers and SIPs, as well as files and metadata, and, as part of the proposed development work, directly associating that with the DSpace Handle System;
- create workflow “pipelines,” pre-configuring processing decisions for transfers and SIPs for groups of like material (i.e., born-digital acquisitions, digitization projects, audio/video, disk images vs. logical copies of directories, web archives, etc.);
- automatically generate a robust METS.xml document, which is automatically added to any SIP generated from a transfer;
- verify transfer checksums to compare data inside of Archivematica with data as it existed outside of Archivematica;
- quarantine a transfer for a set period of time, until virus definitions update;
- remove cache files;
- automatically normalize files to create Dissemination Information Packages and thumbnails, if desired;
- set permissions using PREMIS rights metadata, which, as part of the proposed development work, would also be recorded in ArchivesSpace and would carry over to the ability to embargo collections in DSpace; and
- interact with AIPs and their METS files via an API.
Improving AutoPro
The original Mellon proposal noted that AutoPro is not an ideal solution because component programs installed on individual machines must undergo frequent updates, because Windows CMD.EXE scripts have a limited capacity to handle errors and exceptions and little text-processing functionality, and because scalability becomes an issue.
Web-Based
Because Archivematica is web-based, there is no need to install clients on individual machines, and system updates only need to happen once.
Better Error-Handling
Archivematica was designed to anticipate a wide variety of processing errors. As a result, it also improves upon AutoPro’s ability to handle them. While some errors result in a process being halted and the transfer or SIP being moved to the failed directory, for others, processing can continue. Both types of errors were encountered and corrected during testing, as you can see in this typical "Archivematica Fail Report":
Type
|
Status
|
Started
|
Index
AIP
|
Failed
|
2015-03-16
17:26:35
|
Store
the AIP
|
Completed
successfully
|
2015-03-16
16:53:27
|
Verify
AIP
|
Completed
successfully
|
2015-03-16
16:52:17
|
Move
to processing directory
|
Completed
successfully
|
2015-03-16
16:52:17
|
…
|
||
Move
to processing directory
|
Completed
successfully
|
2015-03-16
15:00:38
|
Normalize
|
Completed
successfully
|
2015-03-16
15:00:38
|
Resume
after normalization file identification tool selected.
|
Completed
successfully
|
2015-03-16
15:00:38
|
Identify
file format
|
Failed
|
2015-03-16
14:42:38
|
Select
pre-normalize file format identification command
|
Completed
successfully
|
2015-03-16
14:42:38
|
Move
to select file ID tool
|
Completed
successfully
|
2015-03-16
14:42:37
|
Set
resume link after tool selected.
|
Completed
successfully
|
2015-03-16
14:42:37
|
…
|
||
Set
file permissions
|
Completed
successfully
|
2015-03-16
14:37:09
|
Create
removal from backlog PREMIS events
|
Completed
successfully
|
2015-03-16
14:37:09
|
Approve
SIP Creation
|
Completed
successfully
|
2015-03-16
14:18:16
|
As you can see, that's a lot of green (actually much more than is displayed here, hence the ellipses); the majority of these micro-services worked just fine. "Identify file format" is an example of an Archivematica error for which processing can and did continue. "Index AIP" is an example of an error for which processing is halted.
Scalability (To Be Determined)
Unfortunately, I'm not able to report out yet on how Archivematica does with scalability. We've heard tell that Archivematica can work on packages as large as one TB. However, I've attempted the 10.7 GB transfer twice, with no luck yet. Artefactual Systems, Inc. is currently working with the Library Information Technology division to get this resolved. Stay tuned for an update to this post.
Conclusion
While we did encounter some issues during Archivematica testing, for the most part it seems that Archivematica (or the proposed feature development) does indeed replace and extend the functionality of AutoPro. We're excited to start using it in production!
[1] https://ww.archivematica.org/en/
[2] Quotations in this section are from https://www.archivematica.org/wiki/Overview.
[3] Curse you, thumbs.db!
[2] Quotations in this section are from https://www.archivematica.org/wiki/Overview.
[3] Curse you, thumbs.db!
No comments:
Post a Comment