text mining – Emerging Tech in Libraries

Backstory

In my text mining class at GSLIS, we had a lot of ground to cover. It was easy enough to jump into Oracle SQL Developer and Data Miner and plug into the Oracle database that had been set up for us, and we moved on to processing and classifying merrily. But now, a year later, I’m totally removed from that infrastructure. I wanted to review my work from that class before heading to EMDA next (!) week, but reacquainting myself with Data Miner would require setting up the whole environment first. Not totally understanding the Oracle ecosystem, I thought it would be easy enough to set a VirtualBox and implement the Linux setup as needed, but after several failures I gave up and decided to try something new. As it turns out, MALLET not only does classification, but topic modeling, too — something I’d never done before.

What is?

Here’s how I understand it: topic modeling, like other text mining techniques, considers text as a ‘bag of words’ that is more or less organized. It draws out clusters of words (topics) that appear to be related because they statistically occur near each other. We’ve all been subjected to wordles — this is like DIY wordles that can get very specific and can seem to approach semantic understanding with statistics alone.

One tool that DH folks mention often is MALLET, the MAchine Learning for LanguagE Toolkit, open-source software developed at UMass Amherst starting in 2002. I was pleased to see that it not only models topics, but does the things I’d wanted Oracle Data Miner to do, too — classify with decision trees, Naïve Bayes, and more. There are many tutorials and papers written on/about MALLET, but the one I picked was Getting Started with Topic Modeling and MALLET from The Programming Historian 2, a project out of CHNM. The tutorial is very easy to follow and approaches the subject with a DH-y literariness.

Exploration

One of my favorite small test texts is personally significant to me — my grandmother’s diary from 1943, which she kept as a 16-year-old girl in Hawaii. I transcribed it and TEI’d it a while ago. I split up my plain-text transcript by month, stripped month and day names from the text (so words wouldn’t necessarily cluster around ‘april’), and imported the 12 .txt files into MALLET. Following the tutorial’s instructions, I ran train-topics and came out with data like this:

January home diary school ve today feel god parents war eyes hours friends make esther changed beauty class true man	February dear girls thing taxi job find wouldn afraid filipino year american beauty live woman movies happened shoes family makes	March papa don mommy asuna men americans nature realize simply told voice world bus skin ha ago japanese blood diary	April dear diary town made white fun dressed learn sun hour days rest week blue soldiers navy kids straight pretty	May dear girls thing taxi job find wouldn afraid filipino year american beauty live woman movies happened shoes family makes	June papa don mommy asuna men americans nature realize simply told voice world bus skin ha ago japanese blood diary
July red day leave dance min insular top idea half country lose realized servicemen lot breeze ahead appearance change lie	August betty wahiawa taxi set show mr wanted party mama ve wrong insular helped played dinner food chapman fil hawaiian	September betty wahiawa taxi set show mr wanted party mama ve wrong insular helped played dinner food chapman fil hawaiian	October johnny rose nice supper breakfast tiquio lunch lydia office ll raymond theater tonight doesn tomorrow altar kim warm forget	November didn left papa richard long met told house back felt sat gave hand don sweet called meeting dress miss	December ray lydia dorm bus lovely couldn caught ramos asked kissed park waikiki close st arm loved xmas held world

Note that some clusters appear twice. MALLET considers the directory of .txt files as its whole corpus, then spits out which clusters each file is most closely associated with.

As you can see, I should really have taken out ‘dear’ and ‘diary.’ But I can see that these clusters make sense. She begins the diary in mid-January. It’s her first diary ever, so she tends first toward the grandiose, talking about changes in technology and what it means to be American, and later begins to write about the people in her life, like Betty, her roommate, and Tiquio, the creepy taxi driver. In almost all of the clusters, the War shows up somehow. But what I was really looking forward to was seeing how her entries’ topics changed in December, when she began dating Ray, the man who would be my grandfather. Aww.

It’s a small text, in the grand scheme of things, clocking in at around 40,000 words. If you want to see what one historian did with MALLET and a diary kept for 27 years, Cameron Blevins has a very enthusiastic blog post peppered with very nice R visualizations.

Last Friday, I attended Computers and Crowds: Unexpected Authors and Their Impact on Scholarly Research at the CUNY Graduate School of Journalism, an excellent event organized by the LACUNY Emerging Tech Committee, LILAC, and OLS.

My notes are available as a very messy PDF of my scribbles made with Paper (iPad app). Another version of the presentation slides by Paul Zenke and Kate Peterson is also online under the title “Black Hats, Farms, and Bubbles.”

Impressions, connections, and resolutions:

What’s a filter bubble? As a web algorithm learns what you are more interested in, you are given more of what you tend to like. It’s a positive feedback loop. The downside of this is that you get less exposure to material that makes you uncomfortable or challenges your preconceptions/politics.
- You need to have a balanced information diet. It was hard enough before personalized filters became the norm — now it’s harder!
- See: Eli Pariser: Beware online “filter bubbles” (10-minute TED talk)
One of the library’s roles may be providing a place of neutrality. We can better provide neutral information for our users by installing tools that increase user privacy and decrease tracking, especially if these might be inconvenient or undesirable to use at home.
Some practices to protect yourself and your students from unwanted tracking:
- clear your history and cookies regularly
- use ad blocking software
- see who’s tracking you using Collusion (Chrome & Firefox plugin)
- use private browsing
- understand how to de-personalize your Google search results
- try out alternatives like Duck Duck Go
Challenge students to evaluate not just the resource, but to evaluate the algorithms that led them there.
- Why might one article rise to the top of the results list using Google or an academic database?
- How would they design a system to recommend material to a friend?
Challenge yourself to understand and compare these algorithms and filters. Do the leg work and the research to ensure you’re providing your students with acceptable platforms for information hunting, consumption, and creation.
- For example, if you use Primo, familiarize yourself with ScholarRank
Algorithm-created content is already here. Narrative Science is hugely successful. NLP and text mining are changing journalism and are on their way to changing academic writing as well.
- Algorithmically-created essays might be the next cheating trend. I have heard of online education programs (MOOCs, probably) asking students for a portfolio of past writing to algorithmically ascertain whether their writing is theirs or not by stylometric analysis
- See also: “The Great Automatic Grammatizator,” a prescient short story by Roald Dahl

Emerging Tech in Libraries

Tag: text mining

Introducing myself to MALLET

Backstory

What is?

Exploration

Algorithms and academic research

Impressions, connections, and resolutions:

Need help with the Commons?