To Save or Not to Save...My old mentor, Tom Powers, used to say that the business of archives is not just about saving things, but also throwing them away--whether to adhere to collecting policies, conserve resources, or to help researchers identify truly essential records. Identifying 'separations' is therefore an essential part of the appraisal process here at the Bentley Historical Library. While your institution might refer to this process as 'weeding' or 'deaccessioning', I'm willing to bet that our goals are the same: removing out-of-scope or superfluous content from accessions before they become part of our permanent collections.
Clearly, it's possible.
At the same time, we're in this for the long-haul, not a sprint; the overall sustainability of our collections (and institutions) furthermore demands the strategic use of our limited resources. Saving everything "just in case" isn't good archival practice: it's hoarding. We're also keenly aware that having staff review and weed content at the item-level is neither efficient nor sustainable (and certainly not the best use of staff time and salaries). Balancing available resources and staff time with the widest possible (and practical) future uses of digital archives thus becomes the crux of the matter (and something we'll no doubt continue to wrestle with for many moons to come)...
Documenting our PoliciesNo matter what course of action we take, it remains important to make our thoughts and reasoning available if for no other reason than to manage stakeholder expectations. Since revamping our processing workflows for physical and digital materials last year, we've put renewed emphasis on 'MPLP' strategies, especially the idea that any kind of weeding or separations should occur at the same level (i.e., folder or series) as the arrangement and description. Item-level weeding is strictly avoided unless there are particularly compelling reasons to remove individual items (such as extremely large file sizes or the presence of sensitive personal information).
- Out of scope material that was not created by or about the creator or items that fall outside of the Bentley's collecting priorities.
- Non-unique or published material that is readily available in other libraries, another collection at the Bentley, or in a publication.
- Non-summary financial records such as itemized account statements, purchase orders, vouchers, old bills and receipts, cancelled checks, and other miscellaneous financial records.
- "Housekeeping" records such as routine memos about meeting times, reminders regarding minor rules and regulations, or information about social activities.
- Duplicate material.
Automation AlleyAs with other aspects of our ingest and processing workflows, we've tried to automate (or at least semi-automate) aspects of our separations/deaccessioning efforts. Two examples of this were discussed in that aforementioned entry on appraisal: scanning for sensitive personal information with bulk_extractor (which still requires the manual review and verification of results) and the identification of duplicate content.
As I discussed our approach to separating duplicate content in that piece, I won't rehash it here (short version: we don't weed individual files, but will deaccession an entire 'backup' folder if it mirrors content in another directory. Wait--was that a rehash? Sorry...). I will, however, note that there have been some informative discussions on the topic of duplicates on the 'digital curation' Google group, including this thread on photo deduplication which has some great links...
Another strategy that we've employed in our current workflows and are developing with Artefactual Systems for inclusion in Archivematica's new appraisal and arrangement tab is functionality to separate all files with a particular extension. Our primary use case has been to remove system files (such as thumbs.db, temporary or backup files, .DS_Store and the contents of __MACOSX directories, including resource forks) that we've considered to be artifacts of the operating system rather than the outcome of functions and activities of our collections' creators.
On this score, some recent discussions from the Digital Curation and Archivematica Google groups have been relevant. Jay Gattuso, Digital Preservation Analyst at the National Library of New Zealand notes that their "current approach is to not ingest the .DS_Store files, as they are not regarded by as an artefact of the collection, more as an artefact of the system the collection came from." This represents our general line of thinking, which has also influenced our approach to migrating content off of removable media and disk imaging: we are devoting our resources to the preservation of content rather than the preservation of file systems and storage environments (except in cases where the preservation of that file system or environment is essential to maintaining the functionality and/or accessibility of important content).
Knowing that there could be important information in legacy resource forks reinforces the need to discuss record creation and management practices with donors as part of the accession process. In many cases, however, these conversations aren't convenient or even possible (as when we deal with the estate of a deceased creator). What do we do then? A quick Google search reveals several tools to view/extract the contents of resource forks...it strikes me that it might be possible to put together a script that could cycle through resource forks and flag any that contain additional information and should thus be preserved. Not having really worked with resource forks or (to my knowledge) encountering one that stored the kind of additional information Adams mentions, I don't know how feasible this would be.