Monday, July 13, 2015

Separation Anxiety!

To Save or Not to Save...

My old mentor, Tom Powers, used to say that the business of archives is not just about saving things, but also throwing them away--whether to adhere to collecting policies, conserve resources, or to help researchers identify truly essential records.  Identifying 'separations' is therefore an essential part of the appraisal process here at the Bentley Historical Library.  While your institution might refer to this process as 'weeding' or 'deaccessioning', I'm willing to bet that our goals are the same: removing out-of-scope or superfluous content from accessions before they become part of our permanent collections.

Of course, with this great power comes great responsibility, which leads to what my good friend David Wallace refers to as the 'dark night of the (archival) soul': can we fully anticipate future uses? Are we getting rid of things that some researcher, somewhere, at some time, might find truly useful?

Clearly, it's possible.

At the same time, we're in this for the long-haul, not a sprint; the overall sustainability of our collections (and institutions) furthermore demands the strategic use of our limited resources.  Saving everything "just in case" isn't good archival practice: it's hoarding.  We're also keenly aware that having staff review and weed content at the item-level is neither efficient nor sustainable (and certainly not the best use of staff time and salaries).  Balancing available resources and staff time with the widest possible (and practical) future uses of digital archives thus becomes the crux of the matter (and something we'll no doubt continue to wrestle with for many moons to come)...

Documenting our Policies

No matter what course of action we take, it remains important to make our thoughts and reasoning available if for no other reason than to manage stakeholder expectations.  Since revamping our processing workflows for physical and digital materials last year, we've put renewed emphasis on 'MPLP' strategies, especially the idea that any kind of weeding or separations should occur at the same level (i.e., folder or series) as the arrangement and description.  Item-level weeding is strictly avoided unless there are particularly compelling reasons to remove individual items (such as extremely large file sizes or the presence of sensitive personal information).

To ensure consistency (and transparency for donors and researchers), we apply the same criteria for separations to digital and physical items, as outlined in our processing manual.  The following categories of materials are thus typically not retained:
  • Out of scope material that was not created by or about the creator or items that fall outside of the Bentley's collecting priorities.
  • Non-unique or published material that is readily available in other libraries, another collection at the Bentley, or in a publication.  
  • Non-summary financial records such as itemized account statements, purchase orders, vouchers, old bills and receipts, cancelled checks, and other miscellaneous financial records.
  • "Housekeeping" records such as routine memos about meeting times, reminders regarding minor rules and regulations, or information about social activities.
  • Duplicate material.
During our review and appraisal of digital archives, we thus keep an eye out for entire directories that contain the above categories of materials.  Given the impracticality of looking at every file, we rely upon reviewing directory and filenames and then viewing/rendering a representative sample of content as needed (using tools mentioned in my previous post on appraisal).  The goal here is not to search for individual files that meet the above criteria, but to catch larger aggregations of content that are simply not appropriate for our permanent collections.

Automation Alley

As with other aspects of our ingest and processing workflows, we've tried to automate (or at least semi-automate) aspects of our separations/deaccessioning efforts.  Two examples of this were discussed in that aforementioned entry on appraisal: scanning for sensitive personal information with bulk_extractor (which still requires the manual review and verification of results) and the identification of duplicate content.

As I discussed our approach to separating duplicate content in that piece, I won't rehash it here (short version: we don't weed individual files, but will deaccession an entire 'backup' folder if it mirrors content in another directory.  Wait--was that a rehash? Sorry...).  I will, however, note that there have been some informative discussions on the topic of duplicates on the 'digital curation' Google group, including this thread on photo deduplication which has some great links...

Another strategy that we've employed in our current workflows and are developing with Artefactual Systems for inclusion in Archivematica's new appraisal and arrangement tab is functionality to separate all files with a particular extension.  Our primary use case has been to remove system files (such as thumbs.db, temporary or backup files, .DS_Store and the contents of __MACOSX directories, including resource forks) that we've considered to be artifacts of the operating system rather than the outcome of functions and activities of our collections' creators.

On this score, some recent discussions from the Digital Curation and Archivematica Google groups have been relevant.  Jay Gattuso, Digital Preservation Analyst at the National Library of New Zealand notes that their "current approach is to not ingest the .DS_Store files, as they are not regarded by as an artefact of the collection, more as an artefact of the system the collection came from."  This represents our general line of thinking, which has also influenced our approach to migrating content off of removable media and disk imaging: we are devoting our resources to the preservation of content rather than the preservation of file systems and storage environments (except in cases where the preservation of that file system or environment is essential to maintaining the functionality and/or accessibility of important content).

Chris Adams adds some important nuances to the same thread, reporting that .DS_Store files are "only used to store custom desktop settings and the most which would happen without them is that you'd lose a custom background or sort order" and noting that "Resource forks (i.e. ._ files on non-HFS filesystems) are far more of a concern because classic Mac applications often stored important user data in them – the classic example being text documents where the regular file fork had only plain text but the resource fork contained styling, images, etc. which are critical for displaying the document as actually authored."

Knowing that there could be important information in legacy resource forks reinforces the need to discuss record creation and management practices with donors as part of the accession process.  In many cases, however, these conversations aren't convenient or even possible (as when we deal with the estate of a deceased creator).  What do we do then?  A quick Google search reveals several tools to view/extract the contents of resource strikes me that it might be possible to put together a script that could cycle through resource forks and flag any that contain additional information and should thus be preserved.  Not having really worked with resource forks or (to my knowledge) encountering one that stored the kind of additional information Adams mentions, I don't know how feasible this would be.


So...where does this leave us?  

As part of our grant we want to be able to separate/deaccession material from within Archivematica by applying tags to specific folders/files or by doing a bulk operation based upon file format extension.  Once the deaccession is finalized, Archivematica would query the user for a description and rationale for the action and create a deaccession record in ArchivesSpace:

Beyond this development work, we're trying to inform our decision-making process for separations/deaccessions by better understanding researcher needs and expectations.  Our archivists have participated in HASTAC 2015 up at Michigan State University in addition to various digital humanities events here at the University of Michigan.  Knowing what kinds of data, tools, and procedures are gaining popularity will hopefully help us save more of the materials (and metadata) that researchers want and need.  Of course, there are still the Rumsfeldian unknown unknowns to contend with....

In any case, your input and feedback (and/or scalable solutions for what to do with those Mac resource forks) would be most gratefully appreciated: let us know what you think!


  1. Is it possible to note when the deaccession has happened during initial apprasial or during later-on processing? I think that will be important to note.
    Also, are you thinking of retaining a trace of the original material (like a DROID report without hashes) that shows the file/folder structure and filenames before apprasial / deletion? - Kari Smith, MIT Institute Archives.

    1. Thanks for reading and your great questions, Kari!

      In response to your first question, in our emerging ASpace-Archivematica workflow, an 'initial appraisal' would probably occur in the field (prior to accession) or as we are creating transfers in Archivematica to send content into the backlog ("yes, include this; no, don't include that"). In the latter case, we would manually create a deaccession record in ASpace. We're aiming to be as flexible as possible in our development work and so it probably ultimately comes down to how an institution structures its workflows and establishes conventions for recording such information.

      For question two: yes, we want do want to make sure that the original context/directory structure of materials is documented/maintained in some way. Our most recent conversations have surfaced the idea of including output from the 'tree' command, though a DROID report could also be useful in that regard.

      In our current non-Archivematica workflow, we are packaging content in .zip files for deposit in our DSpace repository. If a subfolder is being packaged, we include any parent directories so that once researchers download and extract content from the zip, they can see the whole directory structure and better understand the context in which materials were created or maintained. ...or at least that's what we hope for. I'd love to hear from you (or anyone else) as to the best way to document and convey to researchers the provenance and original order of digital archives when they are placed in a digital repository environment!

      Thanks again--