Friday, September 25, 2015

Overcoming Digital Preservation Challenges to Better Serve Users

This afternoon, I'll be participating on a panel at Network Detroit, a conference aimed at sharing and promoting cutting-edge digital work in the humanities. In it, I'll be giving a little different spin on the ArchivesSpace-Archivematica-DSpace Workflow Integration project, one that "fits it into a larger socio-historical and theoretical professional context" and that, hopefully, is halfway interesting digital humanities folks. I even wrote it out humanities conference-style. Citations for quotes and images (as well as acknowledgements for some stuff I stole wholesale from Mike's last blog post on access) are currently in my notes and slides; I'll get those into here soon.

Also, thank you Dallas for organizing a reading group that helped introduce me to archival theory.

Good afternoon. My name is Max Eckard, and I’m the Assistant Archivist for Digital Curation at the Bentley Historical Library (University of Michigan). For some context, and I’m quoting here from our website, “the Bentley Historical Library collects the materials for and promotes the study of... two great... institutions, the State of Michigan [including Detroit] and the University of Michigan.” If you can’t tell from my job title, I work with digital stuff.

I came here today to talk about our Mellon foundation-funded ArchivesSpace-Archivematica-DSpace Workflow Integration project. 

It’s pretty exciting. We’re taking three open source pieces of software, each of which currently occupy their own spaces in the digital curation ecosystem, and integrating them into an end-to-end digital archiving workflow. It will facilitate the formal ingest, description and overall curation of digital archives, all the way to deposit into a preservation and access repository. We’re even doing it in way that will facilitate this workflow for the larger community as well.

And I still plan to talk about this project, but just a little. I started thinking about the audience today--that is, digital humanities folks--and I realized that you might not actually be that interested in the details of what we’re doing.

Instead, I’d like to use this time to think a little bigger about this project and what it is--really--that we’re trying to accomplish. I also thought I'd try my hand a making a real, formal argument--I even wrote this out humanities-conference style.

I’ll start with the concept of archives, what they are, what they do and competing visions for how they function in society and culture. You may even get your first introduction to the way that archivists think about archives. Then I’ll talk about our project, where it fits into this larger socio-historical and theoretical professional context, and how, in many ways, it is a practical critique of our collective professional reaction to the Digital Revolution. Finally, I’ll conclude with my perspective on how and where our profession would like to grow, and here’s a spoiler: it has to do with that second “end” in end-to-end, access. All in 15-20 minutes. So here we go…

Definitions are important. You can’t have a conversation, let alone make an argument, without them.

As you might imagine, there’s not one way that the word “archives” has been or is defined, but I’m going to suggest that we use this definition from the Society of American Archivists (SAA) the professional association for members of our profession, at least in the United States:

Now, I know this isn’t the archives you may have heard about from humanist theorists like Jacques Derrida or Michel Foucault. I also know this isn’t the way this word gets used the vernacular. However, I do think there is something we can all learn from a key difference I see between these humanist definitions and the vernacular use of the word “archives” and the SAA’s definition of it, namely, the latter’s emphasis on the collection itself--and the people and organizations that created it--as well the practical consideration of how it should be collected, that is, on product and process.

Before continuing I’d like to draw your attention to the phrase in that definition that starts with “especially” because it enumerates some of the oldest modern archival principles:


...and original order... well as collective control.

Right from the very beginning of modern archival thinking, which really came into its own after the French Revolution in Europe, provenance and original order have been core archival principles.

In this context, the term “provenance” connotes the individual, family, or organization that created or received the items in a collection. The principle of provenance or the respect des fonds dictates that records of different origins (provenance) be kept separate to preserve their context.

In this context, the term “original order” connotes the organization and sequence of records established by the creator of the records.

Both were codified in the The Manual for the Arrangement and Description of Archives (Manual) of 1898, which detailed these and many other rules concerning both the nature and treatment of archives.

It’s interesting to note that the very first rule in the Manual, the “foundation upon which everything must rest,” according to its authors, gave its own definition of the word archives: “the whole of the written documents, drawings and printed matter, officially received or produced by an administrative body or one of its officials.” Already that should give you some sense of the context out of which these particular principles came, but I’ll get more into that in a second.

If you were to further investigate that definition from the SAA, you’d see that there is an extensive Notes section, which details two more prominent thinkers (DWEM) in the history of archival theory: Hilary Jenkinson and Theodore Schellenberg.

Jenkinson, writing just twenty-four years after the Manual in 1922, defended archives as “impartial evidence” and envisioned archivists as “guardians” of that evidence. He argued that the only real archives were those records that were “part of an official transaction and were preserved for official reference.” For Jenkinson, who, importantly, was coming out of the same context from which the Manual came (and again, more on that in a second), the records creator is responsible for determining which records should be transferred to the archives for preservation. So there’s another archival principle:

...evidential value.

I’ll note here, because I am also supposed to be thinking about cultural criticism and how it relates to my topic, that it is this early context, the context that produced the Manual and Jenkinson, with its emphasis on “administrative bodies” and “officials,” who alone--because archivists were impartial!--were able to determine which records would be preserved for posterity, that the postmodern critique of archives as political agents of the collective memory, whose institutional origins legitimized institutional, statist power and helped to marginalize those without such power, is perhaps most obviously justified (although how far we’ve come since then is definitely up for debate--and trust me, I’m probably on your side).

Fast forward to the middle of the twentieth century, when Schellenberg was writing. Times were changing. Archives were still largely institutional, but the nature of the records they collected were very different. Facing a paper avalanche in the mounting crisis of contemporary records, archivists could no longer responsibly retain the “whole,” as the Manual put it (and Foucault, I might add), of anything. Archival theory responded by shifting from focusing on preservation of records to selection of records for preservation. Schellenberg called this selection process:


To quote the SAA:

He also advocated for working with researchers to determine what records had secondary value. I think is a pretty exciting development in this story, although in his time “researchers” really just meant “historians,” so sorry digital humanities folks.

Many archivists, especially in the United States and including us at the Bentley, have been influenced by Schellenberg.

And then came the Digital Revolution. From Wikipedia:

Analogous to the Agricultural Revolution and Industrial Revolution, the Digital Revolution marked the beginning of the Information Age.

Needless to say, the Digital Revolution has had a profound effect on the nature and treatment of archives in contemporary society, as well as their use. And this is only to be expected, because the very context that produced the Manual and the works of Jenkinson has fundamentally changed.

With the advent of the Internet and social media and the democratizing effect these have had on society (And I’m thinking here of the Arab Spring, and other informal, non-hierarchical movements like #OccupyWallStreet and #BlackLivesMatter...), no one can honestly say that the only records that make a difference anymore, even politically, are those that are produced by “administrative bodies” or “officials.”

Likewise, if Schellenberg thought there was too much paper back in his day, what would he have thought of today’s version of that crisis, which is on a totally different scale? Did you know that there are:
  • 2.9 million emails sent, every second;
  • 375 megabytes of data consumed by households, each day;
  • 24 petabytes of data processed by Google, per day (did you even know petabytes was a word?);
  • etc., etc., etc.

What would Schellenberg have thought of Big Data?

So, the context has fundamentally changed… but wait, there’s more. The records themselves have also fundamentally changed. Digital records (the actual stuff that gets archived), are much more fragile than their physical counterparts, and we have less experience with them.

Digital preservation is challenging!

Specifically, there are issues with digital storage media:
"Digital materials are especially vulnerable to loss and destruction because they are stored on fragile magnetic and optical media that deteriorate rapidly and that can fail suddenly." (Hedstrom and Montgomery 1998)

There are issues with changes in technology:
"Unlike the situation that applies to books, digital archiving requires relatively frequent investments to overcome rapid obsolescence introduced by galloping technological change." (Feeney 1999)

And, there are issues with authenticity and integrity:
“While it is technically feasible to alter records in a paper environment, the relative ease with which this can be achieved in the digital environment, either deliberately or inadvertently, has given this issue more pressing urgency.”

There are other issues, like money (of course!), the fact that access always has to be mediated and the myth that digital material is somehow immaterial, but I don’t really want to get into all that here.

So we, as a profession, started scrambling. We even invented a whole new specialization within library and information science to deal with these radical changes in context and content: curation. Since it’s inception over 10 years ago, digital curators have been hard at work developing strategies that help to mitigate some of the risks I just enumerated, in part to help ensure continued access to digital materials for as long as they are needed.

We adopted and created models, for example, like the Open Archival Information System (OAIS) Reference Model and the DCC Curation Lifecycle Model to inform the systems and the work that we do. We created metadata schemes to record new types of information about digital material, like the Preservation Metadata Implementation Standard (PREMIS), which records information on…

...provenance (sound familiar?), but also preservation activity done to preserve digital material, the technical environment needed to render or interact with it, and rights management.

We started using techniques like checksums and file format migrations in order to verify the authenticity and integrity of digital material (because digital material doesn’t have....

...evidential value if it isn’t what it purports to be).

We even borrowed techniques from law enforcement called write-blocking and disk imaging, so that we could make exact, sector-by-sector copies of source mediums, perfectly replicating the structure and contents of a storage device, which I think sounds a whole lot like a techy version of…

...original order.

Along the way there was a lot of education and advocacy that occurred and is occurring around these issues for both archivists and content creators and actually, a lot of this is the stuff that the Archivematica part of the ArchivesSpace-Archivematica-DSpace Workflow Integration project is good at.

So, where have these developments left us? Technology-wise, it’s maybe mid-2000s. Digital curators often complain that the technology we’ve created to deal with digital archives seems to lag about 10 years behind the archives themselves. Archival theory-wise, though, it’s probably more like 1924, with the Manual and Jenkinson, with provenance, original order and impartial evidence.

Others have observed that even in the face of the enormous scale of the digital deluge, archivists somewhat ironically abandoned another core component of archival theory (also mentioned in the SAA definition, even though I haven’t talked much about it here):

...collective control.

This they did in favor of item-level description, and, with it, “informational content over provenance and context,” treating digital objects as “discrete and isolated items” rather than as part of the “comprehensive information universe of the record creator,” but I think you might have to be an archivist to appreciate that one.

OK, I’d like to start to wrap this up. Two things and then I’m done.

The first is that the ArchivesSpace-Archivematica-DSpace Workflow Integration project is definitely overcoming digital curation and preservation challenges, and it’s doing so in a novel way that brings contemporary archival practice back in line with contemporary archival theory. To be honest, the Curation (previously Digital Curation) division at the Bentley already had a strong, nationally-recognized reputation for this, but for dramatic effect I’ll just pretend that this is all thanks to our project!

The project has a number of goals, but the one that has taken the most time and resources is the development of new appraisal and arrangement functionality in Archivematica, so archivists may review content and, among other things, deaccession or separate some or even all of it from a collection. That is, so that archivists can begin to do with digital records what they have been doing for a long time now with paper records:

...some good ole fashioned Schellenbergian appraisal.

The arrangement part of this new functionality is also really exciting. It helps to address the collective control issue I outlined earlier by allowing archivists to create intellectual arrangements and associate them with archival objects from ArchiveSpace in a pretty sophisticated way, in aggregate, with APIs and everything. Really, this is cool stuff!

But now what? When the grant is over, where will we be?

I really wanted to end this talk by asserting that our work leaves us (us at the Bentley and us as a profession) in a better place to serve users like you. And it does. We were already helping to mitigate all of those risks for everything that comes in our door, making sure it will be available and usable for future generations. At the end of this project we will be doing it even better than we are now.

But, if I’m honest, I’m actually not entirely sure that this project leaves us in a place to better serve users, or, as we call it, to provide “access,” at least not directly.

Actually, and I’m trying to think critically here, what it does is make our lives easier. It improves our process so we can make more product, and, I hate to say it, but our profession is notorious for thinking about product and process, sometimes at the expense of the end users--read, people--that we’re doing all this for. When we do think about access, it’s often an afterthought, and it’s usually about how to lock it down. In fact, there’s not even a reference to access (or users, or even researchers) in the definition of archives I provided earlier, that is, the definition provided by the professional organization for archivists, even if you check the Notes!

But there’s some good news. We are getting better about this. The Bentley is putting a lot of time and resources into some exciting initiatives to provide better access to our materials, both physical and digital, getting some audio and video online, and to engage users that we haven’t traditionally engaged. As a profession, we’re also getting better. It is now common for library and information science programs to teach courses on user experience, and at the SAA’s annual meeting this past August, we talked a lot about access, especially for digital material. It’s an exciting time to be in this field.

At the end of this project, all I can really say is that we (at the Bentley and we in the profession) are more poised than we ever have been to, like you’re doing in your profession, shift the paradigm, to redefine archives even! We’re living in the Information Age, after all. Providing access can mean so much more than it ever has, and, with the Internet, it can break down barriers of time and space like never before. And yes, users like you are pushing the boundaries of what it means to do research with digital archives, and we have to keep up with the new ways you want access. And you should continue to demand this non-traditional type of access. And we should continue to try to keep up...

This last part is to all the archivists and librarians out there (digital humanities folks, you can take it or leave it):

Here’s what we’re currently thinking about with regard to access, at least practice-wise at the Bentley, although I suspect that an abstracted version of this is also what’s on deck for archival theory:
  • exploring and better understanding the challenges and opportunities surrounding the OAIS functional entity of ‘access’;
  • managing rights and enforcing restrictions/permissions (which is something, by the way, that we’ve also historically taken a progressive stance on because, as one of my favorite Tweets from this year's SAA Annual Meeting went:

  • establishing use metrics and collecting quantitative data regarding the impact of our collections and outcomes of curation activities;
  • permitting users a more seamless experience in searching for and using materials that are in disparate/siloed locations; and'
  • leveraging linked data to facilitate research across collections and institutions.

At this stage in the game, we aren't even thinking about specific implementation strategies, but we do know that an access portal should emphasize interoperability, employ open source software and, importantly, focus on end users.

It’s taken us an amazingly long time (since 1898), to get to that last one, a focus on end users.

To conclude, archives have always been about product and process. I’d like to end by suggesting a third “P” to help us move forward in our thinking about access in archival theory and practice:


What did you think? Was it a fair assessment? Let us know!

Friday, September 18, 2015

What We Talk About When We Talk About Access

With a tip o' the hat to Raymond Carver, I want to use this post to try to and illuminate (for myself, if nothing else) some of the angles and issues surrounding 'access' to digital archives.

On the surface, the topic appears simple: I have some stuff that I want people to see, so I put it online or provide a dedicated terminal in my reading room and—voila!—access!

Made available by Flickr user Steve Rhode under a CC Attribution-NonCommercial-NoDerivs 2.0 Generic License
But even in this rosy scenario, there are a lot of questions: what platform would you use to host things online?  Will access copies (i.e., DIPs) differ from preservation copies (AIPs)?  If using a dedicated terminal, how will content be organized and how will researchers find desired materials? If copying files to a terminal or removable media, will staff be able to respond to researcher requests in a timely fashion? And what about rights?

Now, I certainly don't want to be like a certain you-know-who...

...but there are a lot of considerations here. Complex ones, too.  At the same time, simply waiting around for the stars to align and the *perfect* solution to emerge won't cut it, either.  Therefore, inspired by the various presentations on access that I saw last month at SAA, I'd like to give a brief overview of our current approach to access and then lay out some of the questions and challenges we're starting to explore here at the Bentley.

Just Dropped In (To See What Condition My Condition Is In)

The Bentley Historical Library has taken a fairly aggressive (progressive?) approach to providing access to our 'open' or unrestricted digital archives.  All such content is freely available for download and use via our archival community in Deep Blue, the University of Michigan's DSpace repository:

Deep Blue is managed by staff in the University of Michigan Library Information Technology division and we considered ourselves to be very fortunate when we started using it as both a preservation repository and access portal in 2008.  Prior to that (and not having any in-house IT), digital materials were either placed on optical disk and brought out to patrons in our reading room or hung off of our website and linked to from finding aids.

Moving to Deep Blue/DSpace was clearly a step forward, but the change brought about some additional challenges due to the basic structure (dare I say data model?) of the repository:

  • A 'community' contains 'collections' (which may be grouped together in sub-communities, we've formed one of these for university faculty papers)
  • A collection in turn contain 'items' (which may be associated with one or more files or 'bitstreams').
  • The default metadata schema is Dublin Core (which makes crosswalking from EAD ...interesting...)
While this relatively flat structure works great for traditional institutional repository fare (white papers, articles, and discrete digital objects), it really isn't suited for the complex intellectual hierarchies of archival collections.  So we've had to make do...

As with our physical and/or analog record groups and manuscript collections, our materials in Deep Blue are organized by the principle of provenance (and are often extensions of existing collections):

Within a collection we have our items—and here's where we've had to get creative:

Given the flat structure of DSpace, we are using the title metadata to help group related content together and preserve the hierarchical intellectual arrangement of materials.  As a result, the following description from our Jennifer Granholm finding aid...
...becomes the following item title in Deep Blue:
We also package materials in .zip files so that we only have to manage one file and our users don't have to download hundreds or even thousands of files.  Because content must actually be downloaded to a local machine to be used or rendered (unless a particular file format renders with a browser plugin), we have taken to chunking content across multiple .zip files when it gets to be above 2 GB:

The above digital object represents speeches, addresses, and other audio recordings from former Michigan Governor Jennifer Granholm for the year 2010.  All together, there's about 20 GB of content; by dividing this body of content into smaller chunks representing each month of the year, we've made it a bit easier for folks to download content.  And while I certainly don't think this solution is ideal, it's still a lot better than bringing a stack of CDs out to folks in our reading room.

Earlier in this post, I alluded to open or unrestricted content; we actually have three access profiles based upon rights and restrictions:
  • Open materials may be accessed by anyone anywhere at any time.
  • Restricted materials are only available to system administrators and digital curation staff; the items are not visible to other users nor is the metadata searchable.  Content is restricted for a number of reasons, including specifications in a gift agreement; the presence of sensitive personal data (related to HIPAA or FERPA as well as credit card numbers and Social Security numbers); and internal policy (for example, all executive records of the University of Michigan, while FOIA-able, are restricted for 20 years from the date of accession).
  • Reading-room only materials may only be accessed by computers within the IP address range of the library itself (and are not accessible by patrons using university wifi). This class is primarily composed of content where we do not hold copyright or donors have requested more restricted access. Our reading room rules, which all researchers must agree to follow, stipulate that these items "may not be copied, emailed or transferred in any way."  (While placing the burden on the researcher is by no means foolproof, it's much easier to implement and maintain than the locked-down computer terminals with which we earlier experimented.)
In addition to having the metadata and text-based file contents (when not packaged in .zip files) indexed by Google and other search engines, all materials are linked from online finding aids and/or catalog records.   People certainly seem to be finding our content, too: from 2008 through last month, we've registered 620,375 downloads (a figure that excludes downloads from robots or web crawlers).

Access: the Final Frontier

As we enter the final six months of our Mellon grant (and prepare to kick off a Hydra development project with colleagues at the University of Michigan Library), we have returned, time and again, to the challenge of providing access to digital archives.

There are a lot of great access portals to collections out there!  Some of the ones we've been particularly impressed with include those of:

Those are just a few of the many examples out there (and we meant to include your digital collections, but ran out of time...), but we've noticed that while these (and other innovative solutions) are vast improvements over an off-the-shelf option like CONTENTdm, they seem pretty unique to their local institutional context and IT environment.

The work we've been doing with Archivematica and ArchivesSpace has made us firm advocates of community-based approaches where folks at different institutions can share and contribute to common solutions, without having to reinvent the wheel (and continue to support and maintain that reinvented wheel. Indefinitely. All by themselves.).

This community interest recently led us to contribute user stories to the ArcLight project, "an effort to build a Hydra/Blacklight-based environment to support discovery (and digital delivery) of information in archives, initiated by Stanford University Libraries."  Likewise, we were excited to hear about the DPLA Archival Description Working Group and its implications for describing—and searching for and retrieving—digital archives.

Beyond the above, we've also been trying to articulate the different aspects or considerations related to access that could be common to cultural heritage institutions of all sizes and shares.  These are some very, very rough ideas, but we're interested in how we can:

  • Explore and better understand the challenges and opportunities surrounding OAIS functional entity of ‘access’
    • Present (and make understandable) the context/content of archival materials (including the relationships between digital, physical, and analog materials)
    • Enable search and retrieval of information while balancing item-level, aggregate, and collection-level description
    • Provide tools and functionality to view/render various formats (born-digital and digitized), including images, text, audio and moving image, web archives, and disk images 
    • Facilitate the analysis and reuse of data (including visual representations of metadata/data and tools or functionality that would facilitate distant reading of materials and other digital scholarship techniques)
    • Increase engagement with users (crowdsourcing or feedback)
  • Manage rights and enforce restrictions/permissions
  • Establish use metrics and collect quantitative data regarding impact of our collections and outcomes of curation activities
  • Permit users a more seamless experience using materials in searching for and using materials that are in disparate/siloed locations: online catalogs, HathiTrust, digital repositories, web archives, etc.
  • Leverage linked data: facilitate research across collections and institutions
At this stage in the game, we aren't even thinking about specific implementation strategies, as it seems there could/should/might/shall/will be a core set of features or functional requirements that could exist independent of any particular repository platform.  Having said that, it seems to us that an access portal should:
  • Emphasize on Interoperability: 
    • Create connections between tools and services
    • Permit us/other institutions to broaden current work and ‘plug in’ to larger framework
    • Avoid siloed/local solutions 
  •  Employ open source software:
    • There is a “need for engagement beyond simply making source code available, including supporting the development of user communities, creating adequate documentation, and cultivating relationships between developers working in libraries around the country.” (IMLS National Digital Platform)
  • Focus on end users
    • Meet needs within LAM communities for common solutions and interoperability as well as those of end users related to the access and use of digital archives.
    • End users are creating, accessing, and organizing content in ways that were never before possible and, in many cases, without the support of a knowledge professional.  The user should figure prominently in our strategy. How do we bring in their views, and identify the missing voices? (IMLS National Digital Platform)
So that's what we've been thinking about and pondering as of late... What are we missing?  What seems unnecessary?  What do you talk about when you talk about access?