Thursday, January 28, 2016

Archivematica to DSpace (and back again to ArchivesSpace)

So, we haven't talked much yet on this blog about DSpace...
DSpace is a turnkey institutional repository.
Well, I guess that's not entirely true. Mike has written about both digital objects/DSpace items and also our current use of DSpace to provide access to digital archives (to see what condition his condition was in). But it's only been in the most recent phase of development with Artefactual that we've turned to DSpace/Deep Blue (our slightly customized DSpace instance) integration.

If you're interested, we've also been working on some optimization issues, packaging AIPs and prioritizing our development requests for improvements to the Appraisal and Arrangement tab (including some exciting UI enhancements--thanks, Dan!)--but all of that's a topic for another day!

Today's post is on the Archivematica-DSpace part of the ArchivesSpace-Archivematica-DSpace Workflow Integration project. I suspect this will be a two-parter. Today, I'll outline the workflow we envision as well as some of the options we're weighing for making it happen. Later, I'll report back on what course of action we decided to take and why.

Workflow

The essential workflow goes like this:

  1. An archivist accessions (ArchivesSpace), transfers (Archivematica), appraises and arranges (forthcoming in Archivematica 1.6) a SIP.
  2. An archivist "Finalize[s] Arrangement" for a particular digital object and it's components.
  3. Archivematica runs said digital object through the rest of it's Ingest process (we'll be normalizing for preservation but you can do whatever you like!).
  4. Archivematica creates a single Digital Object in ArchivesSpace, with one or more associated Digital Object Components.
  5. Archivematica spits out a bagged AIP (actually two bags, one with the data itself and one with more administrative-type information) into a user-selected collection in DSpace, data (an item composed of one or more bitstreams) and metadata, the latter likely coming from the ArchivesSpace metadata we've been using/creating already within Archivematica (i.e., not pulled from ArchivesSpace, or at least not pulled immediately from ArchivesSpace). [1]
  6. Archivematica updates ArchivesSpace with the relevant information: handle, URL, etc.

Considerations

It all doesn't come down to workflow. We have some goals for the way we'd like for this to work, and Archivematica and DSpace have some additional responsibilities as components of an OAIS.

Our Goals

One of our goals for this Mellon-funded project is to ensure that new features and functionality are modular so that other institutions may adopt some or all project deliverables. Indeed, this is a requirement of Mellon's, an organization that "aims to maximize the use and sustainability of technology," and would like for "funded work to be made publicly available for the long-term benefit of... cultural institutions." 

This is something we take very seriously. It has informed everything about this process, from the way we've done development (so that, for example, institutions who don't use ArchivesSpace can still make use of the Appraisal and Arrangement tab), to our attempts to ensure that workflows are flexible (for MPLP as a baseline folks and for item-level folks) to our plans for sustainability (all code, in addition to just being out there, will be incorporated into Archivematica's core code and maintained by Artefactual going forward). It's even informed the way we've tried to reach out to others on thorny issues and the way we're trying to be as open as possible and share as much as we can on this blog.

All of this holds true for this process of deciding how we'll get data and metadata from Archivematica and DSpace. Ideally, we'd like for this to work for all institutions, regardless of the repository they're using (or even if they're not using a repository, but that parts easy). As we consider a move to Hydra in the next few years, this would actually work out well for us too. If that won't work, we'd at least like for this to work for everyone who uses DSpace, and not be tied specifically to Deep Blue. If even that won't work, we'll reluctantly settle for something that will only work for Deep Blue, for our local Dublin Core conventions or that MLibrary LIT developers will have to develop and maintain because, after all, we do have to make sure that data and metadata do actually get from Archivematica to DSpace by the time the grant concludes.

Archivematica's (and DSpace's) Additional Responsibilities

In addition, Archivematica and DSpace, by virtue of the fact that they are components of a digital preservation system, have some additional responsibilities above and beyond just exchanging data and metadata. Archivematica, for instance, needs to be able to ensure that AIPs have successfully transferred to what we're using for Archival Storage (i.e., DSpace), for example, by having some mechanism to verify checksums on both ends of a transfer. For all you OAIS junkies out there, that would be the Error Checking function of the Archival Storage Functional Entity:
The Error Checking function provides statistically acceptable assurance that no components of the AIP are corrupted in Archival Storage or during any internal Archival Storage data transfer... The Preservation Description Information (PDI) Fixity Information provides some assurance that the Content Information has not been altered as the AIP is moved and accessed.
As a result, whatever protocol or method we use to transfer data and metadata needs to be able to check this kind of thing and throw up an error if something goes wrong.

Archivematica also has a responsibility to be able to reassemble the AIP upon request. That would be the Provide Data function of the Archival Storage Functional Entity:
This function receives an AIP request that identifies the requested AIP(s) and provides them on the requested media type or transfers them to a temporary storage area.
As a result, Archivematica will need to have more granular information about individual bitstreams that make up an AIP than we originally anticipated needing, for example, for the minimal metadata that we'll record for Digital Objects in ArchivesSpace. [2] The handles are important for this, but so are the bitstream URLs, even for the administrative bit that will be hidden to users.

Those are just two examples but I hope they serve to illustrate the fact that this data/metadata exchange won't be quite as simple as copying files from one place to another.

Options

Just last week we had a brainstorming session with representatives from Artefactual, the Bentley, MLibrary LIT and the University of Edinburgh (including someone who works on SWORD) on the topic of Archivematica-DSpace data and metadata exchange. Justin at Artefactual began by outlining what he sees as the three options for getting data and metadata from Archivematica to DSpace, and we spent the hour discussing the advantages and disadvantages of each. 

REST API

The DSpace REST API provides a programmatic interface to DSpace Communities, Collections, Items and Bitstreams. In the latest version, the REST API allows authentication to access restricted content as well as allowing Create, Edit and Delete on DSpace Objects. REST Endpoints allow you to do things like login, logout and, important for our purposes, post metadata and bitstreams to items, post policies to bitstreams, and get handles.

If you want to know more about the REST API, check out the latest documentation on their wiki.

Some advantages of the REST API:
  • Can deposit bitstreams.
  • Can edit metadata.

Some disadvantages:
  • This would not work with other repositories. 
  • We'd need to develop a callback for something like verifying checksums.
  • While you can get access to restricted content, we're not sure if it can handle groups that we use for permissions (for example, Bentley IP addresses for Reading Room Only material).
  • Harder to get handle.

Simple Archive Format

DSpace also has a set of command line tools for importing and exporting items in batches using the DSpace Simple Archive Format. The basic idea is to produce a particular directory structure (like the one you see above) with sub-directories for Items. Each sub-directory contains components for the item's descriptive metadata and the files that make up the item. There are also conventions for the XML files and for the contents of the contents folder. Important for our purposes, the Simple Archive Format allows you to import items to particular collections, can alert (e-mail) folks that items have been imported, can resume a failed import and can add items from a ZIP file. There's also a UI for import, but I don't suspect we'll be using that.

For more on the Simple Archive Format, check out the latest documentation on their wiki.

Some advantages of the Simple Archive Format:
  • It's simple! But seriously, we have a lot of experience with it--we use it all the time.
  • Less work to implement.
  • We could do everything we want with it with some development.
  • Already works with our locally-grown embargo functionality.
  • Returns a file that maps deposited filenames to what they became.

Some disadvantages:
  • This would not work with other repositories.
  • There are some questions about how difficult it would be to make this work for DSpace and specific instances of DSpace like Deep Blue, given variations in Dublin Core and that kind of thing.
  • We'd need to develop a callback for something like verifying checksums.
  • It's an offline communication format, so it's slower and involves more code to maintain.
  • Would have to be developed by individual institutions.

 SWORD

They was looking for the SWORDs, they was looking for the SWORDs! Sorry, couldn't help myself.
This is the SWORD we're looking for.
SWORD (Simple Web-service Offering Repository Deposit) is an interoperability standard that allows digital repositories to accept the deposit of content from multiple sources in different formats via a standardized protocol. SWORD allows clients to talk to repository servers. Important for our purposes, it allows deposit to a SWORD-compliant repository (DSpace is one of them, and so is Fedora) by a third party system (like Archivematica). It allows you to deposit files and, in its latest version, copes not only with the "fire and forget" deposit scenario, but also to facilitate the functions needed to support the whole deposit lifecycle--such as notifying a depositor that a deposit was successful and even verifying a Content-MD5 header. Cool stuff.

If you want to know more about SWORD, check out their website.

Some advantages of using sword:
  • This would work for other repositories besides DSpace, which means a lot for our goals.
  • It's good for depositing files.
  • Hydra/Fedora support is already there.
  • Does things live and dynamically.
  • May allow you to create a handle and add metadata later.
  • Can e-mail folks letting them know that ingest was successful.

And some disadvantages:
  • It's not really for adding or editing metadata.
  • Doesn't handle permissions, restrictions or embargoed items.

Conclusion

Decisions, decisions! It's important to note that it's not like these are all mutually exclusive options. Simple Archive Format scripts and SWORD could be used in conjunction, for example, and this is one of the options we're currently exploring. We could also make changes to the DSpace code itself.

Well, that's about it for Archivematica-DSpace/Deep Blue Integration. Check back soon for an update on the course we've decided to take!

[1] I'm actually not sure what order this will happen in, that is, whether the Digital Object will get created in ArchivesSpace first or if content will get deposited into DeepBlue first.
[2] Check out Mike's post on digital objects and ArchivesSpace for more information on how we envision using Digital Objects in ArchivesSpace. Full disclosure, our thinking on this is evolving a bit, but for now it's still true that we'd like to use Digital Objects in ArchivesSpace mostly for their ability to point or link out to, in our case, a handle. That a fairly minimal implementation compared to all the rich technical and administrative metadata you could add to Digital Objects. It's a system of record thing!
[3] Grantmaking Policies, Andrew J. Mellon Foundation.

No comments:

Post a Comment