Monday, November 16, 2015

Digital Objects and ArchivesSpace

One (somewhat unexpected) challenge in our ArchivesSpace-Archivematica-DSpace Workflow Integration project has involved mapping terms and terminologies across the different platforms.  In conversations with our development partners at Artefactual Systems, members of the ASpace community, and other peer institutions, we've found that it's really important to take a moment and make sure we're all on the same page when we're talking about something like a 'digital object.'

Having some common ground/shared understanding is very important, as our workflow establishes the following equivalences:

1 Archivematica SIP = 1 Archivematica AIP = 1 ASpace Digital Object Record = 1 DSpace Item

I'd like to take this opportunity to review the reasons behind this structure, but first I think it would be useful to take a look at how others in the ASpace user community are approaching digital object records.

Perspectives on the ASpace Digital Object Record

As evidenced by the introductory materials for an ASpace workshop that was held here in Ann Arbor this past January, the digital object record was designed to be flexible:
The Digital Object record is optimized for recording metadata for digitized facsimiles or born-digital resources. The Digital Object record can either be single- or multilevel, that is, it can have sub-components just like a Resource record. Moreover, the record can represent the structural relationship between the metadata and associated digital files--whether as simple relationships (e.g., a metadata record associated with a scanned image, and its derivatives) or complex relationships (e.g., a metadata record for a multi-paged item; and additionally, a metadata record for each scanned page, and its derivatives). One or more file versions can be referenced from the Digital Object metadata record.  The Digital Object record can be created from within a Resource record, or created independently and then either linked or not to a Resource record.
While this flexibility is great, it also provokes a lot of questions about just how to implement the digital object records, some of which have been featured in conversations on the ArchivesSpace Google Group as well as the ArchivesSpace Users Group.

On one end of the spectrum, we have complex digital objects--multilevel intellectual entities comprised of multiple bitstreams that can be represented in a structured hierarchy.  Brad Westbrook provides some examples of this use case in this thread from the ASpace Google Group. Those of us in attendance at the "Using Open-Source Tools to Fulfill Digital Preservation Requirements" workshop a couple weeks ago at iPRES got to see a real-world example of how a complex digital object could be represented in ASpace via content from the UC San Diego Research Data Curation Program in the Online Archive of California

Far more common (based upon conversations with peers and posts to the lists), is a simpler approach in which the digital object record is used primarily to record URL information that will provide links to content from the public ASpace interface or from <dao> elements in exported EAD.  This thread provides some valuable thoughts from Ben Goldman, Jarrett Drake, Chris Prom, Maureen Callahan, and our own Max Eckard.

Several of the important ideas raised in that conversation include the need for institutions to:
  • Define systems of record for data/metadata and determine how ASpace fits into this ecosystem.
  • Identify how information in the digital object records can be used now and in the future (i.e., the records can bring together digital content stored in various systems/locations, serialize information to EAD files, respond to queries via the API, etc.)
I won't attempt to delineate the different positions in the thread but encourage you to give it a thorough read!

Moving from this (very) brief review of the landscape, I wanted to identify some of our key assumptions here at the Bentley:
  • The general position outlined by Max is still accurate ("We're thinking of the DO module more as a place to record location than as a place to "manage" digital objects or the events that happen to them"): we are primarily interested in using the ASpace digital object module to create <dao> tags and links to content in EAD finding aids.
    •  We would therefore not be looking to include technical/preservation metadata about AIPs in the digital object record or do extensive arrangement with the digital object components.
    • With the above in mind, the ‘digital object’ records become somewhat analogous to physical ‘instances’--these are manifestations of the archival description expressed in the associated archival object record.
    •  In addition, within AS a digital object may be ‘simple’ or ‘complex’ (in the latter case, comprised of one or more digital object components).  We're now contemplating slightly more 'complex' digital object records...
    • We've also been working with Artefactual Systems and some other peer institutions to think more about how and where to record machine-understandable/actionable PREMIS rights information associated with digital objects.
  • Within the new Appraisal and Arrangement tab, a dedicated ASpace pane will display the ‘archival objects’ (i.e., the subordinate components) of a given resource record in a hierarchical structure. Within the ASpace pane, users will be able to create new archival objects and add basic metadata.

  • Within the appraisal tab, archivists will drag/drop content (individual files and/or entire directories) to a given ‘archival object’ in the ASpace pane.

    • All content associated with an archival object will be a single SIP/AIP in Archivematica.
    • Furthermore, each SIP/AIP will comprise a single ASpace ‘digital object’
    • 1 ASpace digital object = 1 Archivematica SIP = 1 Archivematica AIP = 1 DSpace item
  • We are not spinning off separate DIPs; we may configure Archivematica's Format Policy Registry (FPR) to spin off lightweight copies for some file formats, but otherwise the Archival Information Packages (AIPs) will serve for both preservation and access.
  • The Bentley's past/current use of DSpace is another factor here, as a single 'item' may contain one or more 'bitstreams' (i.e., files).  We therefore would like to be able to do some minimal arrangement of bitstreams within an ASpace digital object to control how materials will be deposited to DSpace.
    • Whenever possible, we strive to describe materials at an aggregate level, which means that a fairly large number of files (in number or space on disk) may be associated with a given 'item.' We also package content in .zip files to reduce the number of files we have to manage and that our users have to download.
    • To avoid presenting our users with extremely large .zip files that could be difficult to download and access, we often will chunk content across multiple .zips--i.e., instead of one 10 GB .zip, we will provide users with five 2 GB zips, as evidenced in this example from our Governor Jennifer Granholm collection:

    • In other cases, we might want to differentiate between access and preservation copies of materials in a collection. As an example, the following DSpace item includes an .mp4 access copy of a video recording while the .zip file contains an .iso image file of the original DVD:


    • We see the DSpace item as being the equivalent of the ASpace digital object record, with the individual bitstreams corresponding to the digital object components.
    • We won't be using DSpace forever (Michigan recently became a Hydra partner) and so we don't want to predicate our ASpace-Archivematica workflows on legacy systems.

Potential New Features

So...where does this leave us?  I wanted to talk through a possible arrangement workflow (based upon the new Appraisal tab) and how this might be translated into ASpace digital object records.  Let's see how this goes...

We've suggested the addition of an “Add digital object component” button in the ASpace pane (see above screenshot), which could function as follows:
  • A user would select a particular archival object in the ASpace pane and click the “Add digital object component” button.
  • Clicking the button will trigger the creation of a ‘digital object component’ that will appear as a child of the archival object.
    • Adding at least one digital object component essentially creates the main digital object record (which may include multiple components).
    • All the ‘digital object components’ nested under an archival object will comprise a single AS ‘digital object.’
    • In arranging the digital object components, users would only be able to work with 1 level of hierarchy--this will be very simple and minimal ‘arrangement.’
  • A digital object component will essentially be a bucket or a virtual container where one or more files and/or folders may be dragged/dropped.
  • To visually distinguish the ‘digital object component’ from archival objects, it should have a different icon (perhaps use the following from the digital object record in ASpace) and/or the text might have a different colored background.
  • The digital object component would display a default title, comprised of the associated archival object’s title and/or date and a consecutive integer. (In other words, for the archival object ‘Archivematica Series’, the first digital object component would be ‘Archivematica Series 1’, the next would be ‘Archivematica Series 2’ and so forth.)
  • The user would drag one or more files/folders on top of a digital object component. The file(s) and/or folder(s) would be nested under the digital object component. The following example has two digital object components:
  • The user can select a digital object component and click the ‘Edit Metadata’ button. This would permit the user to edit the only pieces of metadata required for digital object components, ‘title’ and/or ‘label’, as seen below in AS:

We've also thought about some simple rules for digital object components (and information packages), as well.  Once an archivist clicks the 'Finalize Arrangement' button, Archivematica will create a SIP for the materials associated with a given archival object and commence its Ingest procedures, which may result in the creation of preservation copies (or OCR text).  Based upon this:
  • If there is only one file, it will be deposited to DSpace as individual bitstreams.
  • If there is more than one file and/or a folder (including derivatives produced by Archivematica), everything in the digital object component will be included in a single .zip file (perhaps using the digital object component title) that will be deposited to DSpace.
  • Additional components of the AIP produced by Archivematica (the logs folder, metadata folder, and METS file) will be packaged in a .zip file and deposited as an additional digital object component (perhaps with some default file name). The Bentley would want this content to be be inaccessible to the general public (and ‘not published’ within the ASpace digital object record).
After Ingest processes are complete and the content has been deposited to DSpace, information will be written back to the ASpace digital object record.  The main (i.e., 'top level') digital object would by default inherit the title and/or date of the associated archival object, employ the DSpace handle for File URI (as well as identifier? TBD…), and have an extent (in bytes) that represents all associated content.  PREMIS rights information could also be written to the digital object record, though we'd love to hear from folks with thoughts about this (for instance, would the associated archival object be a more suitable location?).

The digital object components (i.e., each specific grouping of content as well as the Archivematica logs and metadata) would then be added as children of the main digital object record: 

 
The digital object component records might also include extent information, more specific rights information, or...???

It's been exciting to think about the possibilities of ASpace's digital object record, but the fairly wide-open nature of the endeavor is also daunting, as there's no established best practices to fall back on.  What do you think?  How are (or would) you proceed?  We'd love to get your feedback and/or reactions!

3 comments:

  1. Thanks for this great post sharing your thoughts and the contextual information about previous and on-going discussions related to digital objects.
    I have a couple of comments.

    "Once an archivist clicks the 'Finalize Arrangement' button, Archivematica will create a SIP for the materials associated with a given archival object and commence its Ingest procedures, which may result in the creation of preservation copies (or OCR text). Based upon this: ..."

    For this part, it may be useful to note that what is produced by Archivematica after Ingest is an AIP. The way that you are describing your process is interesting, esp. in separating the AIP md, logs, etc from the AIP data files. When you say that the DIP will be the same as the AIP I think it's important to note this separation of the preservation information from the data files because it demonstrates that your DIPs will not actually be the same as your AIPs. The file formats may be the same for this first pass from accession to Ingest to archival storage, but over time the file formats may change and you may have more in your AIP then you want to deliver to users in a DIP.

    All of what you are describing works really well for the first time through. I'm interested in how you are conceiving adding to your AIPs in the future (adding preservation metadata, additional files in new formats, etc.) as you migrate files, move to a new platform from DSpace, etc.

    As I've been thinking about it, I'd like to see in ArchiveSpace the ability to have two digital instances - one being the DIP and one the AIP. Is that possible in the way you have conceived the Add DO component button? For a good amount of our digital records which are under restriction by policy, we will not be making DIPs right away but will in the future. For other collections, we have both DIP and AIP digital sets and I want to note them both in the ArchivesSpace instances. It's possible to do this now in ArchivesSpace but I don't really see in the ArchivesSpace - Archivematica integration work this ability. We've talked with Artefactual about the ability to upload DIP md without uploading actual file info (dao or URL) and uploading the AIP location as well.

    You have done really great work on this and I'm very glad to see your progress as it helps the rest of us. - Kari Smith

    ReplyDelete
    Replies
    1. Thanks so much for your feedback, Kari--we really value your perspective! (My reply got a little long, so I'll break it into two parts--here's part I...)

      You raise a great point about my rather offhand remark about our AIPs serving as our DIPs. The Archival Information Packages produced by Archivematica do indeed consist of 'objects' (i.e., the data we want to preserve) as well as log files from various microservices, metadata (including any Submission Documentation), and the Archivematica METS file.

      Our current project sprint with Artefactual Systems is focused on repackaging the Archivematica AIP so that the metadata and logs are separated from the actual data/objects of preservation (which may be put into one or more ZIP files via the 'digital object component' work outlined in our post). We will then be providing access to the data in DSpace--the original files side-by-side any preservation copies generated by the normalization microservice. This content (minus the AIP's logs and metadata) will thus serve as our Dissemination Information Package (and reflects our current practice; see http://deepblue.lib.umich.edu/handle/2027.42/95888). We will therefore not be spinning off a separate DIP as provided for in Archivematica's default ingest workflow (which produces the following: https://wiki.archivematica.org/DIP_structure).

      In thinking about this AIP repackaging, Justin Simpson's comments about the Archivmatica storage service and pointer files are relevant (see https://groups.google.com/forum/m/#!topic/archivematica/VPQGd4s7hI8). Any repackaging (including the creation of one or more .zip files) would be recorded in the pointer file as PREMIS events "and the fileSec and structMap of the pointer file get updated by the storage service to include entries for each [component]" so that the original AIP can be reconstructed as needed.

      (End Part I)

      Delete
    2. (OK—here’s part II)

      Your question about moving forward is also quite interesting, as it highlights how different institutions might define their Information Packages and preservation practices appropriate for their local environment.

      We are defining an AIP (and associated ASpace digital object) as a discrete portion of content consisting of one or more files/folders received from a donor/creator as part of a given accession. If we receive additional related content (for instance, Office files or digital photographs) that has some overlap with an existing content, we would not seek to integrate these materials but would instead create a separate AIP/ASpace digital object/DSpace item (which could be associated with the same ASpace archival object record in the resource).

      If we are in a position where content requires additional preservation actions, we would look to take advantage of AIP re-ingest functionality currently under development (https://wiki.archivematica.org/AIP_re-ingest). While we haven't thoroughly explored this path, we would probably replace the earlier version with the newly ingested AIP and reflect all preservation events (such as spinning off new preservation copies) in the Archivematica METS.

      As Michigan recently became a Hydra partner, we have indeed been thinking about how content will migrated to a new platform. I don't have a lot of details right now, but IT staff at our main library have been mapping DSpace objects over to a Hydra head and don't anticipate significant problems. In this new repository environment, we may no longer package content in .zip files--we don't anticipate any issues with our workflow here. It may be that there is some synergy with Ben Armintor/Columbia's work to ingest Archivematica AIPs into Hydra, but I don't know enough about that project to do more than wonder outloud.

      Finally, I think your goal of documenting both AIPs and DIPs in ASpace could be accommodated by our work, but this hasn't been a focus of our work. I don't know that we'll have the time/money to specifically address that use case, but the DIP could likely be added as a second unique ASpace digital object (which could be the simplest approach) or it might be represented as an additional File Version/digital object component. Likewise, uploading just DIP metadata/AIP locations seem feasible--if Archivematica has the information, it's just a matter of determining how (and where) that information can be passed to ASpace via the API.

      Thanks again--and hope this generates some more conversations!

      Cheers,

      Mike

      Delete