Wednesday, October 5, 2016

Transferring Legacy Content into Archivematica's Backlog: `automation-tools` to the Rescue!

First off, these are exciting times at the Bentley. In just about every sense of the word, we are live with ArchivesSpace and, soon, real soon, we'll be live with Archivematica. In fact, I believe at least one of our processors may be in denial about the fact that he'll probably never use AutoPro again:


Remember, Devon, lament is the antidote to denial.

Alright, back to our regularly scheduled program. For us, part of "going live" with Archivematica is ensuring that our current backlog of unprocessed, digital accessions makes its way safely and efficiently into Archivematica's backlog... so that those accessions can be appraised and arranged sometime soon with some new functionality that will be out in version 1.6! Today's post details some of the analysis we did on our current backlog, some of the "pre-transfer" manipulation we've done to the digital accessions in it, the Archivematica environment we're using (at least for now) and the automation-tools setup we're using to automate the process as well as some of the errors we've encountered so far.

Our Current Backlog

As tempted as we were to just dive in, we knew we needed to do a bit of analysis on our current backlog, just to know what we were up against. Here's some basic stats we gathered that informed our subsequent decision-making processes.

Overall:
  • Total deposits: 209
  • Total size: 3.6 TB
  • Total number of files: 5,187,920 

Averages:
  • Average size: 17.2 GB
  • Average number of files: 5683.8

Top ten deposits by size:
  1. 806.6 GB (21.6 %)
  2. 797.1 GB (21.4 %)
  3. 705.6 GB (18.9 %)
  4. 180.4 GB (4.8 %)
  5. 154.1 GB (4.1 %)
  6. 133.2 GB (3.6%)
  7. 128.2 GB (3.4%)
  8. 126.5 GB (3.4 %)
  9. 121.1 GB (3.2 %)
  10. 67.8 GB (1.8%)

For reference, only 6.7% of our deposits make up 99% of our total size; that means we're talking about a pretty long tail here.

Top ten deposits by number of files:
  1. 4,182,060
  2. 208,577
  3. 178,265
  4. 155,198
  5. 154,187
  6. 89,431
  7. 49,201
  8. 17,883
  9. 17,786
  10. 17,541

For reference, only 8.1% of our deposits have more than 5,000 files, so another really long tail.

Ensuring that Archivematica Could Handle this Backlog

Based on a couple of relevant conversations on the Archivematica listserv, we knew that Archivematica, like ArchivesSpace and Pokemon Go and just about every other piece of software out there, can have some scalability and performance issues when you start throwing a lot at it.

There's a couple in particular that we returned to again and again:
  • Archivematica Data Flow: This one gave us Justin Simpson's heuristic for processing space, and taught us that we'd need consider the largest transfer we'd want to process at once, and allocate 3 to 4 times as much space in our processing location (i.e., /var/archivematica/sharedDirectory). Note, this is not the same as storage space, this is just the amount you'll need as Archivematica runs through it's various micro-services.
  • Elasticsearch indexing AIPs with many files: These basically just taught us to keep an eye out on Elasticsearch. Indexing makes about "a zillion" queries (quoting our system administrator here), and with every query Archivematica has to look through all the files in a given transfer, once per file (still quoting). So it can take a while, especially for transfers with lots of files. In fact, this is our most persistent source of frustration during testing, this round and previous rounds.
  • Archivematica scalability on very large object: This one taught us that Archivematica can handle really large files (our biggest, if I had to guess, is probably a ~65 GB video file) and that number of files is really what we should be looking out for.
  • Also, we learned that transfers of 80,000 or 80,000 files just won't work, no matter what setup you've got. This we learned from an internal e-mail but apparently it's out there on the list somewhere.

In the end, there are basically two ways you can ensure that Archivematica doesn't run into scalability and performance issues. You can: 1) reduce the size or volume of your transfers; or 2) up the power of your Archivematica environment. After some back and forth with our system administrator, we ended up doing a little of both.

Pre-Transfer Manipulation

We've settled on somewhat arbitrary parameters for our transfers: no more than 50-60 GB or 50,000 files. For the handful of transfers we encounter that are over that amount (I'm looking at you, Michigan governors and congresspeople), I've manually broken them up and given them a sequential suffix (i.e., _001, _002, _003, etc.) that will become the "name" of the transfer in Archivematica. We'll keep them together with the same accession by ensuring that the accession number in Archivematica is the same. We've even worked this into our procedures going forward.

Archivematica Environment

Here's what we're working with so far. The VM has:
  • 16GB of RAM
  • 4 virtual CPUs
  • ~350GB of disk for processing

Our system administrator also set Elasticsearch's ES_HEAP_SIZE to 2G from the default of 640m. Finally, he made some tweaks to mysql, mostly guesswork based on mysqltuner:
  • key_buffer_size to 512M from the default of 16M
  • query_cache_size to 256M from default of 16M
  • query_cache_limit to 8M from default of 1M
  • innodb_buffer_pool_size to 512M from default of 128M
  • max_heap_table_size to 256M and tmp_table_size to 256M

Finally, he commented out the print statement in index_transfer_files in elasticSearchFunctions.py and added an index on currentLocation (this particular change will make it into Archivematica proper as you can see for yourself on this pull request). You can see the explanation on the Archivematica Tech List.

Aaron in awesome! These tweaks took our "Index transfer contents" step from 5 days on a 50,000 file transfer (when we got tired of waiting and quit) to about 2 hours. What!!!!?!?!?!!? Did I mention that Aaron is awesome!

Enter automation-tools


automation-tools are a set of Python scripts that are designed to automate the processing of transfers in an Archivematica pipeline. They are what's preventing me from manually starting and checking on all 209 (now 249 since I broke some of them up) transfers.

Installation


You won't need to if you're using the Ansible scripts, but if not, you can install automation-tools using the instructions on the README. We ended up doing it both ways here on different versions of Archivematica.

Setup and Configuration/Customization

There's a couple of different ways to customize the automation-tools. One is by specifying a default processing configuration, which you can do by using the Administration tab in Archivematica to set the default processing configuration, then copying the processing configuration file from /var/archivematica/sharedDirectory/sharedMicroServiceTasksConfigs/processingMCPConfigs/defaultProcessingMCP.xml to the transfers/ directory of your automation-tools installation location. Here's our default processing MCP if you're interested. That way, all transfers will get that processing configuration, even if you've changed the one in the dashboard to something different.

Another way to customize is by creating "hooks," which tweak the behavior of automation-tools. There are a couple of options here, but one we've found particularly useful is the get-accession-number hook. This automatically fills in the accession number for a given transfer. Our version of this script takes the folder name (this is the only variable you have to work with, which for us is currently based on a local, "Digital Deposit ID" convention), looks that up in a simple dictionary of Digital Deposit IDs and Accession numbers, which was exported from our local FileMaker Pro database. It even accounts for transfers that have been split up and those that (gasp!) don't have accession numbers for one reason or another. All this serves to ensure that when we do a search through the backlog, we get every transfer associated with a particular accession, even if there's more than one!

Note: It's easy to miss, but you must name this script "get-accession-number" (without an extension). Don't use "get-accession-id" (despite what the instructions say), "get-accession-number.py" or create a folder called "get-accession-number" with a script inside of it. Yes, we made all of those mistakes...

There's also an option for pre-transfer hooks. We haven't explored these yet, but I can see using this to, for example, ensure that an Accession record with a particular ID and/or extent is already in ArchivesSpace (or, if not, even creating one on the fly when a transfer to Archivematica is made).

Running

You can test the automation-tools by running them with the following command:
/usr/share/python/automation-tools/bin/python -m transfers.transfer --user <user> --api-key <apikey> --ss-user <user> --ss-api-key <apikey> --transfer-source <transfer_source_uuid> --config-file <config_file>

Here's what all this means:
  • /usr/share/python/automation-tools/bin/python
The first part of that command tells the automation-tools to use a particular version of Python (the one with particular libraries that you installed earlier). 

  •  transfers.transfer
You then tell it which script to run (for us, for now, the transfer script--you can also run ingest scripts). 

  •  --user <user> --api-key <apikey> --ss-user <user> --ss-api-key <apikey>
Next you give it your credentials for both Archivematica and the Storage Service, which you can find by navigating to your user profile in Archivematica and the Storage Service, respectively. 

  • --transfer-source <transfer_source_uuid>
Transfer source tells the automation-tools where to look for transfers (you can find the UUID in the Storage Services).

  • --config-file <config_file>
The configuration file tells automation-tools where to store the database, log files and a "-pid.lck" file which keeps the automation-tools from starting a new transfer when one is already going. By the way, every once in a while this seems to stick around when it should so you have to go there and delete it.

We also add the following options:
  • --transfer-type 'unzipped bag'
This simply lets the automation-tools know what kind of transfer you're doing. We have to specify this since we're not using the default of 'standard'.

  • --hide
This hides the transfer from the dashboard when we're through. As we learned from this conversation on Managing the dashboard (transfer and ingest), you've got to keep this cleaned up or else you'll run into some browser timeout issues.

  • --verbose
This increases the debugging output because, well, why not?
You could also write a shell script (don't forget to chmod +x transfer-script.sh to make it an executable first!) and when you're ready to go, you can set up a cron job (or in our case, three) to automatically run different versions of the script (that point to different source locations or have different parameters) at given intervals. Here's what our crontab entry looks like:

0,5,10,15,20,25,30,35,40,45,50,55 * * * * /usr/lib/archivematica/automation-tools/transfer-script_legacy.sh
1,6,11,16,21,26,31,36,41,46,51,56 * * * * /usr/lib/archivematica/automation-tools/transfer-script_bhl-digitalarchive.sh
2,7,12,17,22,27,32,37,42,47,52,57 * * * * /usr/lib/archivematica/automation-tools/transfer-script_bhl-dropbox.sh


This runs the three automation-tools script we have (one for legacy transfers, two for current transfers from two different source locations to which we'll move transfers from separate staging locations) at regular intervals (every five minutes), but ensures that they never run at the same time. Be prepared for about "a zillion" emails (for real) from some guy named "Cron Daemon" :).

Once you have this going, sit back, relax, party like a partyparrot and never look at the Archivematica transfer dashboard again!


Errors Encountered Thus Far

That is, until you hit an error. And we've hit a few. Here's what our inboxes looked like after the first weekend of letting automation-tools go:




This, I'll admit, looks like a lot of failures, except when you remember that many more transfers went through without issue. If we can count what's gone through so far as a sample, we have about a 80% success rate. Excluding the errors we got due to some permissions issues we eventually worked out, here's a couple of common errors we got and how we're handling them:
  • Approve transfer micro-service, Verify bag, and restructure for compliance job: This is the most common. Generally, this has been because some thumb.db file has snuck in or out, and just requires a simple bag update to get things going again.



  • Approve transfer micro-service, Assign checksums and file sizes to objects job: The only commonality I could see in these is that many of them look like they have a weird character in the file name or path (e.g., "black body 1025.mov" or "FY2014/archival files/thach_diane_2014/Fireside_Preview_Panel7.svg", or look like a hidden file that starts with a period (like "._ 2002 - Mary Sue Coleman"). Actually, we're still looking into this error--let us know if you've got any ideas!
  • Scan for viruses micro-service, Scan for viruses job: Yep, this actually happened. It's an example of a time where our local virus scanner (we use System Center Endpoint Protection) didn't catch something that Archivematica's (ClamAV) did. Because of the discrepancy we thought it might be a false positive. In any case, this happened to be something that was mass produced (actually from the development branch of the university... hmm...), so I simply got another copy and got rid of the old one.
  • Create SIP from Transfer micro-service, Index transfer contents job: This one never actually failed, it just went very slowly. See the tweaks we made above.

Conclusion

And so begins our Archivematica journey, one transfer at a time! Recently we got the full demo, pre-SIP to AIP to DIP from Artefactual... we're super pumped! In a few short weeks, we'll be live "for real" with Archivematica here. Stay tuned for more!

No comments:

Post a Comment