Workflow ResourcesMaybe we should pause for wee bit of context (we are archivists, after all!). As I mentioned in a previous post, our 2010-2011 MeMail project gave us a great opportunity to explore and standardize our procedures for preparing born-digital archives for long-term preservation and access. It was also a very fruitful period of research into emerging best practices and procedures.
At the time, there wasn't a ton of publicly available documentation on workflows established by peer institutions, but the following projects proved to be tremendously helpful in our workflow planning and development:
- Personal Archives Accessible in Digital Media (paradigm) (2005-2007): an excellent resource for policy questions and considerations related to the acquisition, appraisal, and description of digital personal papers. By not promoting specific tools or techniques (which would have inevitably fallen out of date in the intervening years), the project workbook has remained a great primer for collecting and processing digital archives.
- AIMS Born-Digital Collections: An Inter-Institutional Model for Stewardship (2009-2011): another Mellon-funded project that involved the University of Virginia, Stanford University, University of Hull (U.K.) and Yale University. The project blog provided a wealth of information on tools, resources, and strategies and their white paper is essential reading for any archivist or institution wrestling with the thorny issues of born-digital archives, from donor surveys through disk imaging.
- Practical E-Records: Chris Prom's blog from his tenure as a Fulbright scholar at the University of Dundee's Center for Archive and Information Studies yielded a lot of great resources for policy and workflow development as well as reviews of handy and useful tools.
- Archivematica: we first became aware of Archivematica at the 2010 SAA annual meeting, when Tim Pyatt and Seth Shaw featured it in a preconference workshop. While the tool was still undergoing extensive development at this point (version 0.6), the linear nature of its workflow and clearly defined nature of its microservices were hugely influential in helping us sketch out a preliminary workflow for our born-digital accessions.
- Preliminary procedures to document the initial state of content and check for potential issues
- Step 1: Initial survey and appraisal of content
- Step 2: Scan for Personally Identifiable Information (primarily Social Security numbers and credit card numbers)
- Step 3: Identify file extensions
- Step 4: File format conversions
- Step 5: Arrangement and Description
- Step 6: Transfer and Clean Up
MicroservicesOne of the greatest lessons I took from the AIMS project and Archivematica was the use of microservices; that is, instead of building a massive, heavily interdependent system, I defined functional requirements and then identified a tool that would complete the necessary tasks. These tools could then be swapped out or shifted around in the workflow to permit greater flexibility and easier implementation.
Rather than dwell too extensively on the individual procedures in our workflow (that's what the manual is for!), I would like to provide some examples of how we accomplish steps using various command prompt/CMD.EXE utilities as microservices. Having said that, I feel compelled to call attention to the following:
- I am an archivist, not a programmer; at one point, I thought I could use Python for AutoPro, but quickly realized I had a much better chance of stringing something together with Windows CMD.EXE shell scripts, as they were easy to learn and use. Even then, I probably could have done better....give a holler if you see any egregious errors!
- For a great tutorial on commandline basics, see A/V PReserve's "Introduction to Using the Command Line Interface for Working with Files and Directories."
- As I hinted above, we're a Windows shop and the following code snippets reflect CMD.EXE commands. Many of these applications can be run on Mac/Linux machines via native versions or WINE.
- The CMD.EXE shell needs to know where non-native applications/utilities are located; users should CD into the appropriate directory or include the full systems path to the application in the command.
- If any paths (to applications or files) contain spaces, you will need to enclose the path in quotation marks.
- The output of all these operations are collected in log files (usually by redirecting STDOUT) so that we have a full audit trail of operations and a record of any actions performed on content.
- In the code samples, 'REM' is used to comment out notes and descriptions.
REM _procDir=Path to processing folder "C:\Program Files\Microsoft Security Client\MpCmdRun.exe" -SignatureUpdate -MMPC "C:\Program Files\Microsoft Security Client\MpCmdRun.exe" -scan -scantype 3 -file %_procDir%
REM _procDir=Path to processing folder diruse.exe /B /S %_procDir% CD /D %_procDir% md5deep.exe -rclzt *
Diruse.exe will output the entire directory hierarchy (thanks to the /S option) and provide the number of files and relative size (in bytes, due to the /B option) in addition to providing total number of files and size for the main directory.
For md5deep, changing to the processing directory will facilitate returning relative paths for content. Our command includes the following parameters:
- -r: recursive mode; will traverse the entire directory structure
- -c: produces comma separated value output
- -l: outputs relative paths (as dictated by location on command prompt)
- -z: returns file sizes (in bytes)
- -t: includes timestamp of file creation time
- *: the asterisk indicates that everything in the present working directory will be included in output.
Extract Content from Archive FilesIn order to make sure that content stored in archive files is extracted and run through important preservation actions, we search for any such content and use 7-Zip to extract content.
First, we search the processing directory for any archive files, and save the full path to a text file:
CD /D %_procDir% DIR /S /B *.zip *7z *.xz *.gz *.gzip *.tgz *.bz2 *.bzip2 *.tbz2 *.tbz *.tar *.lzma *.rar *.cab *.lza *.lzh | FINDSTR /I /E ".zip .7z .xz .gz .gzip .tgz .bz2 .bzip2 .tbz2 .tbz .tar .lzma .rar .cab .lza .lzh" > ..\archiveFiles.txt
The dir utility (similar to "ls" on a Mac or Linux terminal) employs the /S option to recursively list content and the /B option to return full paths. The list of file extensions (by no means the best way to go about this, but...) will only return paths that match this pattern. For greater accuracy, we then pipe ("|") this output to the findstr ("find string") command, which uses the /I option for a case-insensitive search and /E to match content at the end of a line.
We then iterate through this list with a FOR loop and send each file to our ":_7zJob" extraction function with the filename (%%a) passed along as a parameter:
FOR /F "delims=" %%a in (..\archiveFiles.txt ) DO (CALL :_7zJob "%%a") REM when loop is done, GOTO next step :_7zJob REM Create folder in _procDir with the same name as archive; if folder already exists; get user input SET _7zDestination="%~dpn1"
MKDIR %_7zDestination% REM Run 7zip to extract files 7z.exe x %1 -o%_7zDestination% REM Record results (both success and failure) IF %ERRORLEVEL% NEQ 0 ( ECHO FAILED EXTRACTION GOTO :EOF ) ELSE ( ECHO SUCCESSFUL EXTRACTION! GOTO :EOF )
As the path to each archive file is sent to :_7zJob, we use the CMD.EXE's built-in parameter extension functionality to isolate a folder path using the root name as the archive file (%~dpn1; Z:\unprocessed\9834_0001\newsletters.zip thus would be Z:\unprocessed\9834_0001\newsletters). This path will be the destination for files extracted from a given archive file; we save it as a variable (%_7zDestination%) and create a folder with the MKDIR command.
We then run 7-Zip, using the 'x' option to extract content from the archive file (represented by %1) and use the -o option to send the output to our destination folder. Finally we check the return code (%ERRORLEVEL%) for 7-Zip; if it is not equal to 0 then extraction has failed. Our production script includes an option to retry the operation.
Length of PathsBecause Windows cannot handle file paths longer than 255 characters, we run tlpd.exe ("Too Long Paths Detector") to identify any files or directories that might cause us trouble.
REM _procDir=Path to processing folder START "TOO LONG PATHS" /WAIT tlpd.exe %_procDir% 255
As we're calling this application from a batch (".bat") file, I use the START command to launch it in a new shell window and add the /WAIT option so that the script will not proceed to the next operation until this is comple"255" utility lets you specify the path length, as tlpd.exe lets you adjust the search target.
Step 1: Initial Survey
Step 2: PII ScanThis step nicely illustrates the flexibility of a microservice approach to workflow design, as we are currently using our third different application for this process. Early iterations of the workflow employed the Cornell Spider, but the high number of false positives (i.e., nine digit integers interpreted as SSNs) made reviewing scan results highly labor-intensive. (Cornell no longer hosts a copy, but you can check it out in the Internet Archive.)
We next employed Identity Finder, having learned of it from Seth Shaw (then at Duke University). This tool was much more accurate and included functionality to redact information from plain text and Microsoft Office Open XML files. At the same time, Identity Finder was rather expensive and a change in its enterprise pricing at the University of Michigan (and the open source nature of our Mellon grant development), have led us to a third solution: bulk_extractor.
Already featured in Archivematica and a prominent component of the BitCurator project, bulk_extractor provides a rich array of scanners and comes with a viewer to inspect scan results. I am in the processing of rewriting our PII scan script to include bulk_extractor (ah...the glory of microservices!) and will probably end up using some variation on the following command:
bulk_extractor -o "Z:\path\to\output\folder" -x aes -x base64 -x elf -x email -x exif -x gps -x gzip -x hiberfile -x httplogs -x json -x kml -x msxml -x net -x rar -x sqlite -x vcard -x windirs -x winlnk -x winpe -x winprefetch -R "Z:\path\to\input"
We are only using a subset of the available scanners; the "-x" options are instructing bulk_extractor to exclude certain scanners that we aren't necessarily interested in.
We're particularly interested in exploring how the BEViewer can be integrated into our current workflow (and possibly into Archivmatica's new Appraisal and Arrangement tab? We'll have to see...). In any case, here's an example of how results are displayed and viewed in their original context:
Step 3: Identifying File Extensions
REM Generate a DROID report REM _procDir = processing directory java -jar droid-command-line-6.1.5.jar -R -a "%_procDir%" -p droidExtensionProfile.droid REM Export report to a CSV file java -jar droid-command-line-6.1.5.jar -p droidExtensionProfile.droid -e extensionMismatchReport.csv
In the first command, DROID recursively scans our processing directory and outputs to our profile file (droidExtensionProfile.droid). In the second, we export this profile to a .csv file, one column of which indicates file extension mismatch with a value of true (the file extension does not match the format profile detected by DROID) or false (extension is not in conflict with profile).
Basic CMD.EXE is pretty lousy at parsing .csv files, so I do one extra step and make this .csv file a tab delimited file, using a Visual Basic script I found somewhere on the Internets. (This is getting ugly--thanks for sticking with us!)
We then loop through this tab delimited file and pull out all paths that have extension mismatches:
FOR /F "usebackq tokens=4,13,14,15 delims= " %%A in (`FINDSTR /IC:" true " "extensionMismatchReport.tsv"`) DO CALL :_fileExtensionIdentification "%%A" "%%B" "%%C" "%%D"
Once again we use our FOR loop, with the tab character set as the delimiter. We will loop through each line of our extension mismatch report, looking for where DROID returned "true" in the extension mismatch column and we'll then be pulling out information from four columns and pass these as arguments to our ":_fileExtensionIdentification" function: 4 (full path to content), 13 (file extension; employed to identify files with no extension ), 14 (PUID, or PRONOM Unique IDentifier), and 15 (mime type).
Once this information is passed to the function, we first run the TrID file identifier utility:
Based upon the file's binary signature, TrID will present the likelihood of the file being a format (and extension) as a percentage:
Because the output from this tool may be indeterminate, we also use curl to grab the PRONOM format profile (using the PUID as a variable in the command), save this information to a file, and then look for any signature tags that will enclose extension information:
curl.exe http://apps.nationalarchives.gov.uk/pronom/%_puid%.xml > pronom.txt TYPE pronom.txt | FINDSTR /C:"<Signature>"
The TYPE command will print a file to STDOUT and we then pipe this to FINDSTR to identify only those lines that include extensions.
Based upon the information from the these tools, the archivist may elect to assign a new extension to a file (which choice is recorded in a log file) or simply move on the the one if neither utility presents compelling evidence.
Step 4: Format ConversionFollowing the lead of Archivematica, we've chosen to create preservation copies of content in 'at-risk' file formats as a primary preservation strategy. In developing our conversion pathways, we conducted an extensive review of community best practices and were strongly influence by the Library of Congress's "Sustainability of Digital Formats", the Florida Digital Archive's "File Preservation Strategies", and Archivematica's format policies.
This step involves searching for "at-risk" formats by extension (another reason we've incorporated functionality for file extension identification) and then looping through each list and sending content to different applications. We also calculate an eight character CRC32 hash for each original file and append it to the new preservation copy to (a) avoid file name collisions and (b) establish a link between the preservation and original copies. Below are some of our most common conversion operations:
Raster Images: .bmp .psd .pcd .pct .tga --> .tif (convert.exe utility from ImageMagick)
convert.exe "%_original%" "%_preservation%.tif"
Vector Images: .ai .wmf .emf --> .svg (Inkscape)
inkscape.exe -f "%_original%" -l "%_preservation%.svg"
.PDF --> .PDF/A (Ghostscript)
gswin64.exe -sFONTPATH="C:\Windows\Fonts;C:\Program Files\gs\gs9.15\lib" -dPDFA -dBATCH -dNOPAUSE -dEmbedAllFonts=true -dUseCIEColor -sProcessColorModel=DeviceCMYK -dPDFACompatibilityPolicy=1 -sDEVICE=pdfwrite -sOutputFile="%_preservation%" "%_original%"
In the above example, I'm using a 64 bit version of Ghostscript. I won't even try to unpack all the options associated with this command, but check out the GS documentation for more info. Note that if you update your PDFA_def.ps file with the location of an ICC color profile, you will need to use double backslashes in the path information.
Audio Recordings: .wma .ra .au .snd --> .wav (FFmpeg)
REM Use FFprobe to get more information about the recording ffprobe.exe -loglevel panic "%_original%" -show_streams > ffprobe.txt REM Parse FFprobe output to determine the number of audio channels FOR /F "usebackq tokens=2 delims==" %%c in (`FINDSTR /C:"channels" ffprobe.txt`) DO (SET _audchan=%%c) REM Run FFmpeg, using the %_audchan% variable ffmpeg.exe -i "%_original%" -ac %_audchan% "%_preservation%.wav"
Video Files: .flv .wmv .rv .rm .rmvb .mts --> .mp4 with h.264 encoding (FFmpeg)
REM Use FFprobe to get more information about the recording ffprobe.exe -loglevel panic "%_original%" -show_streams > ffprobe.txt REM Parse FFprobe output to determine the number of audio channels FOR /F "usebackq tokens=2 delims==" %%c in (`FINDSTR /C:"channels" ffprobe.txt`) DO (SET _audchan=%%c) REM Run FFmpeg, using the %_audchan% variable ffmpeg.exe -i "%_original%" -ac %_audchan% -vcodec libx264 "%_preservation%.wav"
Legacy Word Processing Files: .wp .wpd .cwk .sxw .uot .hwp .lwp .mcw .wn --> .odt (LibreOffice)
REM Run LibreOffice as a service and listening on port 2002 START "Libre Office" /MIN "C:\Program Files (x86)\LibreOffice 4\program\soffice.exe" "-accept=socket,port=2002;urp;" --headless REM Run DocumentConverter python script using the version of python included in LibreOffice. "C:\Program Files (x86)\LibreOffice 4\program\python.exe" DocumentConverter.py "%_original%" "%_preservation%.odt"
This conversion requires the PyODConverter python script.
Microsoft Office Files: .doc .xls .ppt --> Office Open XML (OMPM)
Step 5: Arrangement, Packaging, and Description
Step 6: Transfer and Clean Up
REM Generate a DROID report REM _procDir = processing directory java -jar droid-command-line-6.1.5.jar -R -a "%_procDir%" -p droidProfile.droid REM Export report to a CSV file java -jar droid-command-line-6.1.5.jar -p droidProfile.droid -e DROID.csv
We then use the Library of Congress's BagIt tool to 'bag' the fully processed material and then (to speed things up) copy it across the network to a secure dark archive using TeraCopy.
REM _procDir = Processing directory bagit-4.4\bin\bag baginplace %_procDir% --log-verbose REM We then use TeraCopy to move the content to our dark archive location teracopy.exe COPY %_procDir% %_destination% /CLOSE
An additional copy of material will then be uploaded to Deep Blue, our DSpace repository.
- eventType: Name or title of the event (i.e., "virus scan").
- eventIdentifierType: We're using UUIDs to identify events.
- eventIdentifierValue: A UUID to uniquely identify the event.
- eventDateTime: Timestamp for when the event concluded.
- eventDetail: Note providing additional information for the event.
- eventOutcome: Was the process completed? (Completion indicates success.)
- linkingAgentIdentifierType: We use MARC21 codes to identify agents.
- linkingAgentIdentifierValue: MiU-H (Our MARC21 code.)
- linkingAgentRole: Executor (i.e., the library executed this action).
- linkingAgentIdentifierType: "Tool" (we use this second agent record to identify software used in events).
- linkingAgentIdentifierValue: Name and version of software.
- linkingAgentRole: "Software"