Bentley Historical Library Curation Team Blog: Maximizing Microservices in Workflows

Today I wanted to talk a little (maybe a lot?) about our development of ingest and processing workflows for digital archives at the Bentley Historical Library, with a focus on the role of microservices.

Workflow Resources

Maybe we should pause for wee bit of context (we are archivists, after all!). As I mentioned in a previous post, our 2010-2011 MeMail project gave us a great opportunity to explore and standardize our procedures for preparing born-digital archives for long-term preservation and access. It was also a very fruitful period of research into emerging best practices and procedures.

At the time, there wasn't a ton of publicly available documentation on workflows established by peer institutions, but the following projects proved to be tremendously helpful in our workflow planning and development:

Personal Archives Accessible in Digital Media (paradigm) (2005-2007): an excellent resource for policy questions and considerations related to the acquisition, appraisal, and description of digital personal papers. By not promoting specific tools or techniques (which would have inevitably fallen out of date in the intervening years), the project workbook has remained a great primer for collecting and processing digital archives.
AIMS Born-Digital Collections: An Inter-Institutional Model for Stewardship (2009-2011): another Mellon-funded project that involved the University of Virginia, Stanford University, University of Hull (U.K.) and Yale University. The project blog provided a wealth of information on tools, resources, and strategies and their white paper is essential reading for any archivist or institution wrestling with the thorny issues of born-digital archives, from donor surveys through disk imaging.
Practical E-Records: Chris Prom's blog from his tenure as a Fulbright scholar at the University of Dundee's Center for Archive and Information Studies yielded a lot of great resources for policy and workflow development as well as reviews of handy and useful tools.
Archivematica: we first became aware of Archivematica at the 2010 SAA annual meeting, when Tim Pyatt and Seth Shaw featured it in a preconference workshop. While the tool was still undergoing extensive development at this point (version 0.6), the linear nature of its workflow and clearly defined nature of its microservices were hugely influential in helping us sketch out a preliminary workflow for our born-digital accessions.

Using the above as guidelines (and inspiration), we cobbled together a manual workflow that was useful in terms of defining functional requirements but ultimately not viable as a guide for processing digital archives due to the many potential opportunities for user error or inconsistencies with metadata collection, file naming, copy operations, etc.

These shortcomings led me to automate some workflow steps and ultimately produced the AutoPro tool and our current ingest and digital processing workflow:

Preliminary procedures to document the initial state of content and check for potential issues
Step 1: Initial survey and appraisal of content
Step 2: Scan for Personally Identifiable Information (primarily Social Security numbers and credit card numbers)
Step 3: Identify file extensions
Step 4: File format conversions
Step 5: Arrangement and Description
Step 6: Transfer and Clean Up

Microservices

One of the greatest lessons I took from the AIMS project and Archivematica was the use of microservices; that is, instead of building a massive, heavily interdependent system, I defined functional requirements and then identified a tool that would complete the necessary tasks. These tools could then be swapped out or shifted around in the workflow to permit greater flexibility and easier implementation.

Rather than dwell too extensively on the individual procedures in our workflow (that's what the manual is for!), I would like to provide some examples of how we accomplish steps using various command prompt/CMD.EXE utilities as microservices. Having said that, I feel compelled to call attention to the following:

I am an archivist, not a programmer; at one point, I thought I could use Python for AutoPro, but quickly realized I had a much better chance of stringing something together with Windows CMD.EXE shell scripts, as they were easy to learn and use. Even then, I probably could have done better....give a holler if you see any egregious errors!
For a great tutorial on commandline basics, see A/V PReserve's "Introduction to Using the Command Line Interface for Working with Files and Directories."
As I hinted above, we're a Windows shop and the following code snippets reflect CMD.EXE commands. Many of these applications can be run on Mac/Linux machines via native versions or WINE.
The CMD.EXE shell needs to know where non-native applications/utilities are located; users should CD into the appropriate directory or include the full systems path to the application in the command.
If any paths (to applications or files) contain spaces, you will need to enclose the path in quotation marks.
The output of all these operations are collected in log files (usually by redirecting STDOUT) so that we have a full audit trail of operations and a record of any actions performed on content.
In the code samples, 'REM' is used to comment out notes and descriptions.

Preliminary Procedures

Upon initiating a processing session, we run a number of preliminary processes to document the original state of the digital accession and identify any potential problems.

Virus Scan

The University of Michigan has Microsoft System Center Endpoint Protection installed on all its workstations. Making the best of this situation, we use the MpCmdRun.exe utility to scan content for viruses and malware, first checking to make sure the antivirus definitions are up to date:

 REM _procDir=Path to processing folder  
 "C:\Program Files\Microsoft Security Client\MpCmdRun.exe" -SignatureUpdate -MMPC  
 "C:\Program Files\Microsoft Security Client\MpCmdRun.exe" -scan -scantype 3 -file %_procDir%

Initial Manifest

Content is stored in our interim repository using the Library of Congress BagIt specification. When ingest and processing procedures commence, we create a new document to record the structure and size of the accession using diruse.exe and md5deep:

 REM _procDir=Path to processing folder  
 diruse.exe /B /S %_procDir%  
 CD /D %_procDir%  
 md5deep.exe -rclzt *

Diruse.exe will output the entire directory hierarchy (thanks to the /S option) and provide the number of files and relative size (in bytes, due to the /B option) in addition to providing total number of files and size for the main directory.

For md5deep, changing to the processing directory will facilitate returning relative paths for content. Our command includes the following parameters:

-r: recursive mode; will traverse the entire directory structure
-c: produces comma separated value output
-l: outputs relative paths (as dictated by location on command prompt)
-z: returns file sizes (in bytes)
-t: includes timestamp of file creation time
*: the asterisk indicates that everything in the present working directory will be included in output.

Extract Content from Archive Files

In order to make sure that content stored in archive files is extracted and run through important preservation actions, we search for any such content and use 7-Zip to extract content.

First, we search the processing directory for any archive files, and save the full path to a text file:

 CD /D %_procDir%  
 DIR /S /B *.zip *7z *.xz *.gz *.gzip *.tgz *.bz2 *.bzip2 *.tbz2 *.tbz *.tar *.lzma *.rar *.cab *.lza *.lzh | FINDSTR /I /E ".zip .7z .xz .gz .gzip .tgz .bz2 .bzip2 .tbz2 .tbz .tar .lzma .rar .cab .lza .lzh" > ..\archiveFiles.txt

The dir utility (similar to "ls" on a Mac or Linux terminal) employs the /S option to recursively list content and the /B option to return full paths. The list of file extensions (by no means the best way to go about this, but...) will only return paths that match this pattern. For greater accuracy, we then pipe ("|") this output to the findstr ("find string") command, which uses the /I option for a case-insensitive search and /E to match content at the end of a line.

We then iterate through this list with a FOR loop and send each file to our ":_7zJob" extraction function with the filename (%%a) passed along as a parameter:

 FOR /F "delims=" %%a in (..\archiveFiles.txt ) DO (CALL :_7zJob "%%a")  
   
 REM when loop is done, GOTO next step  
   
 :_7zJob  
 REM Create folder in _procDir with the same name as archive; if folder already exists; get user input  
 SET _7zDestination="%~dpn1"  
 MKDIR %_7zDestination%
   
 REM Run 7zip to extract files  
 7z.exe x %1 -o%_7zDestination%   
   
 REM Record results (both success and failure)  
 IF %ERRORLEVEL% NEQ 0 (  
      ECHO  FAILED EXTRACTION  
      GOTO :EOF  
 )     ELSE (  
      ECHO  SUCCESSFUL EXTRACTION!  
      GOTO :EOF  
 )

As the path to each archive file is sent to :_7zJob, we use the CMD.EXE's built-in parameter extension functionality to isolate a folder path using the root name as the archive file (%~dpn1; Z:\unprocessed\9834_0001\newsletters.zip thus would be Z:\unprocessed\9834_0001\newsletters). This path will be the destination for files extracted from a given archive file; we save it as a variable (%_7zDestination%) and create a folder with the MKDIR command.

We then run 7-Zip, using the 'x' option to extract content from the archive file (represented by %1) and use the -o option to send the output to our destination folder. Finally we check the return code (%ERRORLEVEL%) for 7-Zip; if it is not equal to 0 then extraction has failed. Our production script includes an option to retry the operation.

Length of Paths

Because Windows cannot handle file paths longer than 255 characters, we run tlpd.exe ("Too Long Paths Detector") to identify any files or directories that might cause us trouble.

 REM _procDir=Path to processing folder  
 START "TOO LONG PATHS" /WAIT tlpd.exe %_procDir% 255

As we're calling this application from a batch (".bat") file, I use the START command to launch it in a new shell window and add the /WAIT option so that the script will not proceed to the next operation until this is comple"255" utility lets you specify the path length, as tlpd.exe lets you adjust the search target.

Step 1: Initial Survey

In the initial review and survey phase of the workflow, AutoPro incorporates a number of applications to view or render content (Quick View Plus, Irfanview, VLC Media Player, Inkscape) and also employs TreeSize Professional and several Windows utilities to analyze and characterize content. We'll take a closer look at these latter tools in a forthcoming post on appraising digital content.

Step 2: PII Scan

This step nicely illustrates the flexibility of a microservice approach to workflow design, as we are currently using our third different application for this process. Early iterations of the workflow employed the Cornell Spider, but the high number of false positives (i.e., nine digit integers interpreted as SSNs) made reviewing scan results highly labor-intensive. (Cornell no longer hosts a copy, but you can check it out in the Internet Archive.)

We next employed Identity Finder, having learned of it from Seth Shaw (then at Duke University). This tool was much more accurate and included functionality to redact information from plain text and Microsoft Office Open XML files. At the same time, Identity Finder was rather expensive and a change in its enterprise pricing at the University of Michigan (and the open source nature of our Mellon grant development), have led us to a third solution: bulk_extractor.

Already featured in Archivematica and a prominent component of the BitCurator project, bulk_extractor provides a rich array of scanners and comes with a viewer to inspect scan results. I am in the processing of rewriting our PII scan script to include bulk_extractor (ah...the glory of microservices!) and will probably end up using some variation on the following command:

 bulk_extractor -o "Z:\path\to\output\folder" -x aes -x base64 -x elf -x email -x exif -x gps -x gzip -x hiberfile -x httplogs -x json -x kml -x msxml -x net -x rar -x sqlite -x vcard -x windirs -x winlnk -x winpe -x winprefetch -R "Z:\path\to\input"

We are only using a subset of the available scanners; the "-x" options are instructing bulk_extractor to exclude certain scanners that we aren't necessarily interested in.

We're particularly interested in exploring how the BEViewer can be integrated into our current workflow (and possibly into Archivmatica's new Appraisal and Arrangement tab? We'll have to see...). In any case, here's an example of how results are displayed and viewed in their original context:

Step 3: Identifying File Extensions

The identification of mismatched file extensions is not a required step in our workflow is intended solely to help end-users access and render content.

As a first step, we run the UK National Archives' DROID utility and export a report to a .csv file. Before running this command, we open up the tool preferences and uncheck the "create md5 checksum" option so that the process runs faster.

 REM Generate a DROID report  
 REM _procDir = processing directory  
 java -jar droid-command-line-6.1.5.jar -R -a "%_procDir%" -p droidExtensionProfile.droid   
   
 REM Export report to a CSV file  
 java -jar droid-command-line-6.1.5.jar -p droidExtensionProfile.droid -e extensionMismatchReport.csv

In the first command, DROID recursively scans our processing directory and outputs to our profile file (droidExtensionProfile.droid). In the second, we export this profile to a .csv file, one column of which indicates file extension mismatch with a value of true (the file extension does not match the format profile detected by DROID) or false (extension is not in conflict with profile).

Basic CMD.EXE is pretty lousy at parsing .csv files, so I do one extra step and make this .csv file a tab delimited file, using a Visual Basic script I found somewhere on the Internets. (This is getting ugly--thanks for sticking with us!)

We then loop through this tab delimited file and pull out all paths that have extension mismatches:

 FOR /F "usebackq tokens=4,13,14,15 delims=     " %%A in (`FINDSTR /IC:"     true     " "extensionMismatchReport.tsv"`) DO CALL :_fileExtensionIdentification "%%A" "%%B" "%%C" "%%D"

Once again we use our FOR loop, with the tab character set as the delimiter. We will loop through each line of our extension mismatch report, looking for where DROID returned "true" in the extension mismatch column and we'll then be pulling out information from four columns and pass these as arguments to our ":_fileExtensionIdentification" function: 4 (full path to content), 13 (file extension; employed to identify files with no extension ), 14 (PUID, or PRONOM Unique IDentifier), and 15 (mime type).

Once this information is passed to the function, we first run the TrID file identifier utility:

 trid.exe %_file%

Based upon the file's binary signature, TrID will present the likelihood of the file being a format (and extension) as a percentage:

Because the output from this tool may be indeterminate, we also use curl to grab the PRONOM format profile (using the PUID as a variable in the command), save this information to a file, and then look for any signature tags that will enclose extension information:

 curl.exe http://apps.nationalarchives.gov.uk/pronom/%_puid%.xml > pronom.txt  
 TYPE pronom.txt | FINDSTR /C:"<Signature>"

The TYPE command will print a file to STDOUT and we then pipe this to FINDSTR to identify only those lines that include extensions.

Based upon the information from the these tools, the archivist may elect to assign a new extension to a file (which choice is recorded in a log file) or simply move on the the one if neither utility presents compelling evidence.

Step 4: Format Conversion

Following the lead of Archivematica, we've chosen to create preservation copies of content in 'at-risk' file formats as a primary preservation strategy. In developing our conversion pathways, we conducted an extensive review of community best practices and were strongly influence by the Library of Congress's "Sustainability of Digital Formats", the Florida Digital Archive's "File Preservation Strategies", and Archivematica's format policies.

This step involves searching for "at-risk" formats by extension (another reason we've incorporated functionality for file extension identification) and then looping through each list and sending content to different applications. We also calculate an eight character CRC32 hash for each original file and append it to the new preservation copy to (a) avoid file name collisions and (b) establish a link between the preservation and original copies. Below are some of our most common conversion operations:

Raster Images: .bmp .psd .pcd .pct .tga --> .tif (convert.exe utility from ImageMagick)

 convert.exe "%_original%" "%_preservation%.tif"

Vector Images: .ai .wmf .emf --> .svg (Inkscape)

 inkscape.exe -f "%_original%" -l "%_preservation%.svg"

.PDF --> .PDF/A (Ghostscript)

 gswin64.exe -sFONTPATH="C:\Windows\Fonts;C:\Program Files\gs\gs9.15\lib" -dPDFA -dBATCH -dNOPAUSE -dEmbedAllFonts=true -dUseCIEColor -sProcessColorModel=DeviceCMYK -dPDFACompatibilityPolicy=1 -sDEVICE=pdfwrite -sOutputFile="%_preservation%" "%_original%"

In the above example, I'm using a 64 bit version of Ghostscript. I won't even try to unpack all the options associated with this command, but check out the GS documentation for more info. Note that if you update your PDFA_def.ps file with the location of an ICC color profile, you will need to use double backslashes in the path information.

Audio Recordings: .wma .ra .au .snd --> .wav (FFmpeg)

 REM Use FFprobe to get more information about the recording  
 ffprobe.exe -loglevel panic "%_original%" -show_streams > ffprobe.txt  
   
 REM Parse FFprobe output to determine the number of audio channels  
 FOR /F "usebackq tokens=2 delims==" %%c in (`FINDSTR /C:"channels" ffprobe.txt`) DO (SET _audchan=%%c)  
   
 REM Run FFmpeg, using the %_audchan% variable  
 ffmpeg.exe -i "%_original%" -ac %_audchan% "%_preservation%.wav"

Video Files: .flv .wmv .rv .rm .rmvb .mts --> .mp4 with h.264 encoding (FFmpeg)

 REM Use FFprobe to get more information about the recording  
 ffprobe.exe -loglevel panic "%_original%" -show_streams > ffprobe.txt  
   
 REM Parse FFprobe output to determine the number of audio channels  
 FOR /F "usebackq tokens=2 delims==" %%c in (`FINDSTR /C:"channels" ffprobe.txt`) DO (SET _audchan=%%c)  
   
 REM Run FFmpeg, using the %_audchan% variable  
 ffmpeg.exe -i "%_original%" -ac %_audchan% -vcodec libx264 "%_preservation%.wav"

Legacy Word Processing Files: .wp .wpd .cwk .sxw .uot .hwp .lwp .mcw .wn --> .odt (LibreOffice)

 REM Run LibreOffice as a service and listening on port 2002  
 START "Libre Office" /MIN "C:\Program Files (x86)\LibreOffice 4\program\soffice.exe" "-accept=socket,port=2002;urp;" --headless  
   
 REM Run DocumentConverter python script using the version of python included in LibreOffice.  
 "C:\Program Files (x86)\LibreOffice 4\program\python.exe" DocumentConverter.py "%_original%" "%_preservation%.odt"

This conversion requires the PyODConverter python script.

Microsoft Office Files: .doc .xls .ppt --> Office Open XML (OMPM)

This operation requires the installation of Microsoft's Office Compatibility Pack and Office Migration Planning Manager Update 1 (OMPM). Before running, the C:\OMPM\Tools\ofc.ini file must be modified to reflect the "SourcePathTemplate" and the "DestinationPathTemplate" (examples are in the file). Once modified, the OFC.EXE utility will run through and convert all legacy Office file formats to the 2010 version of Office Open XML with the following command:

 OFC.EXE

Step 5: Arrangement, Packaging, and Description

This step involves a number of applications for conducting further reviews of content and also employs 7-Zip to package materials in .zip files and a custom Excel user form for recording descriptive and administrative metadata. We'll explore this functionality in more depth in a later post.

Step 6: Transfer and Clean Up

To document the final state of the accession (especially if preservation copies have been created or materials have been packages in .zip files), we run DROID a final time. After manually enabling the creation of md5 checksums, we employ the same commands as used before:

 REM Generate a DROID report  
 REM _procDir = processing directory  
 java -jar droid-command-line-6.1.5.jar -R -a "%_procDir%" -p droidProfile.droid   
   
 REM Export report to a CSV file  
 java -jar droid-command-line-6.1.5.jar -p droidProfile.droid -e DROID.csv

We then use the Library of Congress's BagIt tool to 'bag' the fully processed material and then (to speed things up) copy it across the network to a secure dark archive using TeraCopy.

 REM _procDir = Processing directory  
 bagit-4.4\bin\bag baginplace %_procDir% --log-verbose   
   
 REM We then use TeraCopy to move the content to our dark archive location  
 teracopy.exe COPY %_procDir% %_destination% /CLOSE

An additional copy of material will then be uploaded to Deep Blue, our DSpace repository.

PREMIS

I should also mention we record PREMIS event information for all preservation actions at the accession level. Because I had no idea how to work with XML when we started this, we write the following elements to .csv files:

eventType: Name or title of the event (i.e., "virus scan").
eventIdentifierType: We're using UUIDs to identify events.
eventIdentifierValue: A UUID to uniquely identify the event.
eventDateTime: Timestamp for when the event concluded.
eventDetail: Note providing additional information for the event.
eventOutcome: Was the process completed? (Completion indicates success.)
linkingAgentIdentifierType: We use MARC21 codes to identify agents.
linkingAgentIdentifierValue: MiU-H (Our MARC21 code.)
linkingAgentRole: Executor (i.e., the library executed this action).
linkingAgentIdentifierType: "Tool" (we use this second agent record to identify software used in events).
linkingAgentIdentifierValue: Name and version of software.
linkingAgentRole: "Software"

Click here for an example of one of these PREMIS spreadsheets.

Conclusion

So that's it--you can go get on with your day and I'll get back to this grant project. Please get us a holler if you have any questions or suggestions as to how some of this could work more efficiently. We're excited that before long all these CMD.EXE scripts will be a thing of the past, but they've treated us pretty well so far and maybe this post will help you out with some things.

Stay tuned for future posts on appraisal strategies, packaging techniques, ArchivesSpace, Archivematica, DSpace, and more!

Tuesday, May 12, 2015

Maximizing Microservices in Workflows