So what's an archivist and/or researcher to do?
According to some, assigning and then browsing by subject headings is the solution to this challenge.  Before we all jump on that bandwagon, though (how do you even get to meaningful subject headings in the first place when you have regular born-digital accessions that are many GBs to TBs in size, containing tens of thousands, if not hundreds of thousands, of individual documents), let's walk through a couple of exercises on methods and tools that can be used for characterizing and identifying records in the digital age:
- Analysis of Drives
- Digital Forensics
- Digital Humanities Methods and Tools
- Text Analysis with Python
- Named Entity Recognition
- Topic Modeling
Analysis of Drives
ExerciseExamining and considering the files in ES-Test-Files or on your computer.
- Download and install WinDirStat. Note: If you are looking for an alternative for Linux, you are looking for KDirStat and for Mac it would be Disk Inventory X or GrandPerspective.
- Open windirstat.exe.
- Examine and consider the files in ES-Test-Files (these were assembled for a guest lecture on this topic) or on your computer.
- When prompted to Select Drives, select All Local Drives, Individual Drives, or A Folder.
- Take a look at the Directory List.
- Take a look at the Extension List.
- Take a look at the Treemap.
- Try coupling the views:
- Directory List → Treemap.
- Treemap → Directory List.
- Directory List | Treemap → Extension List.
- Extension List → Treemap.
- Play around some!
|Bulk Extractor Viewer|
ExerciseExamine the contents of Identity_Finder_Test_Data and identify sensitive data, like social security numbers and credit card numbers. You can also use it to find out what e-mail addresses (correspondence series, anyone?), telephone numbers and website people visit (or at least show up in text somewhere), and how often.
- Download and install Bulk Extractor. Note: Windows users should use the development build located here.
- Open BEViewerLauncher.exe.
- Under the Tools menu, select Run bulk_extractor…, or hit Ctrl + R.
- When prompted to Run bulk_extractor, scan a Directory of Files and navigate to Identity_Finder_Test_Data.
- Create an Output Feature Directory such as Identify_Finder_Test_Data_BE_Output.
- Take a look at the General Options.
- Take a look at the Scanners. Note: More information on bulk_extractor scanners is located here.
- Submit Run.
- When it’s finished, open the Output Feature Directory and verify that files have been created. Some will be empty and others will be populated with data.
- In Bulk Extractor Viewer select the appropriate report from Reports.
- Take a look at the Feature Filter.
- Take a look at the Image.
- Play around some!
Digital Humanities Methods and Tools
Text Analysis with Python
MethodText Analysis is automated and computer-assisted method of extracting, organizing, and consuming knowledge from unstructured text.
- Download and install Python. Note: Mac users, Python will be pre-installed on your machines. You may also choose to install Python's NumPy and Matplotlib packages (and all their dependencies). This isn't strictly necessary, but you’ll need them in order to produce the graphical plots we'll be using.
- Download and install NLTK.
- Open Interactive DeveLopment Environment (IDLE), or any command line or terminal.
- Type: import nltk
- Type: nltk.download()
- At the NLTK Downloader prompt, select all (or book, if you are concerned about size) and Download.
- Exit the NLTK Downloader.
- Type: from nltk.book import *
- Type: text1, text2, etc. to find out about these texts.
- Type: text1.collocations()to return frequent word combinations.
- Type: text1.concordance(“monstrous”)
- Play around some! For example, look for nation, terror, and god in text4 (Inaugural Address Corpus) and im, ur, and lol in text5 (NPS Chat Corpus).
- Type: text1.similar(“monstrous”)
- Type: text2.similar(“monstrous”)
- Observe that we get different results for different texts. Austen uses this word quite differently from Melville; for her, monstrous has positive connotations, and sometimes functions as an intensifier like the word very.
- Type: text2.common_contexts([“monstrous”, “very”])
- Play around some! Pick another pair of words and compare their usage in two different texts, using the similar() and common_contexts() functions.
- Type: text4.dispersion_plot([“citizens”, “democracy”, “freedom”, “duties”, “America”])
|Lexical Dispersion Plot for Words in U.S. Presidential Inaugural Addresses: This can be used to investigate changes in language use over time.|
- Type: len(text3) to get the total number of words in the text.
- Type: set(text3) to get all unique words in a text.
- Type: len(set(text3)) to get the total number of unique words in the text, including differences in capitalization.
- Type: text3_lower = [word.lower() for word in text3] to make all words lowercase. Note: This is so that capitalized and lowercase versions of the same word don't get counted as two words.
- Type: from nltk.corpus import stopwords
- Type: stopwords = stopwords.words(“english”)
- Type: text3_clean = [word for word in text3_lower if word not in stopwords] to remove common words from text.
- Type: len(text3_clean) / len(set(text3_clean)) to get total words divided by set of unique words, or lexical diversity.
|This plot shows lexical diversity over time for the Michigan Technic, a collection we've digitized here at the Bentley, using Seaborn, a Python visualization library based on matplotlib. Not much of a trend either way, although I'm not sure what happened in 1980.|
|Not exactly an archival example, but one of my favorites nonetheless. The Largest Vocabulary in Hip Hop: Rappers, Ranked by the Number of Unique Words in their Lyrics. The Wu Tang Association is not to be trifled with.|
- Type: fdist1 = nltk.FreqDist(text1)
- Type: print fdist1
- Type: fdist1.most_common(50)
- Type: fdist1.[“whale”]
- Type: fdist1.plot(50, cumulative=True)
- Play around some! Try the preceding frequency distribution example for yourself.
- Type: fdist1.hapaxes() to view the words in the text that occur only once.
|Cumulative Frequency Plot for 50 Most Frequently Words in Moby Dick: these account for nearly half of the tokens.|
Named Entity Recognition 
|Stanford Named Entity Recognizer|
- Download and unzip Stanford NER.
- Download and unzip looted_heritage_reports_txt.zip.
- Open stanford-ner.jar.
- Under Classifier click Load CRF From File.
- Navigate to Stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz.
- Under the File menu, click Open File.
- Navigate to any file in the unzipped looted_heritage_reports_txt folder (or use any text file).
- Click Run NER.
- Observe that every entity that Stanford NER is able identify is now tagged.
- Under the File menu, click Save Tagged File As …
|ePADD, a software package developed by Stanford University's Special Collections & University Archives, supports archival processes around the appraisal, ingest, processing, discovery, and delivery of email archives. This image previews it's natural language processing functionality.|
|Those lists of places could be geo-coded and turned into something like this, a map of Children's Traffic Fatalities in Detroit 2004-2014 (notice how the size of the dot indicates frequency).|
Topic models provide a simple way to analyze large volumes of unlabeled text. A "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings.
|Topic Modeling Tool|
- Download Topic Modeling Tool.
- Open TopicModelingTool.jar.
- Click Select Input File or Dir and navigate to the unzipped looted_heritage_reports_txt folder (or use any folder of text files).
- Create a new folder on your computer to hold the topic model data you will create, such as looted_heritage_reports_text_output.
- Click Select Output Dir and select the new folder.
- For Number of topics: enter 15.
- Click Learn Topics.
- Take a look at the output_csv directory.
- Take a look at the output_html directory.
- Explore the Advanced… options.
And there's more! Take a look at this example of topic modeling in action:
|Topic Model Stacked Bar Graphs from Quantifying Kissinger.|
 At least according to this article, and this one as well. Yes, I was very thorough...
 Maybe. Possibly. Not as much as we like to think, probably. According to our most recent server logs for September (with staff users and bots filtered out), only 0.25% of users of our finding aids looked at our subject headings. That says something, even if it is true that our current implementation of subject headings is rather static.
 The following two sections contain content developed by Thomas Padilla and Devin Higgins for a Mid-Michigan Digital Practitioners workshop, Leveraging Digital Humanities Methods and Tools in the Archive. Thanks, guys!