Category: Tool

Introducing myself to MALLET

Backstory

In my text mining class at GSLIS, we had a lot of ground to cover. It was easy enough to jump into Oracle SQL Developer and Data Miner and plug into the Oracle database that had been set up for us, and we moved on to processing and classifying merrily. But now, a year later, I’m totally removed from that infrastructure. I wanted to review my work from that class before heading to EMDA next (!) week, but reacquainting myself with Data Miner would require setting up the whole environment first. Not totally understanding the Oracle ecosystem, I thought it would be easy enough to set a VirtualBox and implement the Linux setup as needed, but after several failures I gave up and decided to try something new. As it turns out, MALLET not only does classification, but topic modeling, too — something I’d never done before.

What is?

Here’s how I understand it: topic modeling, like other text mining techniques, considers text as a ‘bag of words’ that is more or less organized. It draws out clusters of words (topics) that appear to be related because they statistically occur near each other. We’ve all been subjected to wordles — this is like DIY wordles that can get very specific and can seem to approach semantic understanding with statistics alone.

One tool that DH folks mention often is MALLET, the MAchine Learning for LanguagE Toolkit, open-source software developed at UMass Amherst starting in 2002. I was pleased to see that it not only models topics, but does the things I’d wanted Oracle Data Miner to do, too — classify with decision trees, Naïve Bayes, and more. There are many tutorials and papers written on/about MALLET, but the one I picked was Getting Started with Topic Modeling and MALLET from The Programming Historian 2, a project out of CHNM. The tutorial is very easy to follow and approaches the subject with a DH-y literariness.

Exploration

One of my favorite small test texts is personally significant to me — my grandmother’s diary from 1943, which she kept as a 16-year-old girl in Hawaii. I transcribed it and TEI’d it a while ago. I split up my plain-text transcript by month, stripped month and day names from the text (so words wouldn’t necessarily cluster around ‘april’), and imported the 12 .txt files into MALLET. Following the tutorial’s instructions, I ran train-topics and came out with data like this:

January
home
diary
school
ve
today
feel
god
parents
war
eyes
hours
friends
make
esther
changed
beauty
class
true
man
February
dear
girls
thing
taxi
job
find
wouldn
afraid
filipino
year
american
beauty
live
woman
movies
happened
shoes
family
makes
March
papa
don
mommy
asuna
men
americans
nature
realize
simply
told
voice
world
bus
skin
ha
ago
japanese
blood
diary
April
dear
diary
town
made
white
fun
dressed
learn
sun
hour
days
rest
week
blue
soldiers
navy
kids
straight
pretty
May
dear
girls
thing
taxi
job
find
wouldn
afraid
filipino
year
american
beauty
live
woman
movies
happened
shoes
family
makes
June
papa
don
mommy
asuna
men
americans
nature
realize
simply
told
voice
world
bus
skin
ha
ago
japanese
blood
diary
July
red
day
leave
dance
min
insular
top
idea
half
country
lose
realized
servicemen
lot
breeze
ahead
appearance
change
lie
August
betty
wahiawa
taxi
set
show
mr
wanted
party
mama
ve
wrong
insular
helped
played
dinner
food
chapman
fil
hawaiian
September
betty
wahiawa
taxi
set
show
mr
wanted
party
mama
ve
wrong
insular
helped
played
dinner
food
chapman
fil
hawaiian
October
johnny
rose
nice
supper
breakfast
tiquio
lunch
lydia
office
ll
raymond
theater
tonight
doesn
tomorrow
altar
kim
warm
forget
November
didn
left
papa
richard
long
met
told
house
back
felt
sat
gave
hand
don
sweet
called
meeting
dress
miss
December
ray
lydia
dorm
bus
lovely
couldn
caught
ramos
asked
kissed
park
waikiki
close
st
arm
loved
xmas
held
world

Note that some clusters appear twice. MALLET considers the directory of .txt files as its whole corpus, then spits out which clusters each file is most closely associated with.

As you can see, I should really have taken out ‘dear’ and ‘diary.’ But I can see that these clusters make sense. She begins the diary in mid-January. It’s her first diary ever, so she tends first toward the grandiose, talking about changes in technology and what it means to be American, and later begins to write about the people in her life, like Betty, her roommate, and Tiquio, the creepy taxi driver. In almost all of the clusters, the War shows up somehow. But what I was really looking forward to was seeing how her entries’ topics changed in December, when she began dating Ray, the man who would be my grandfather. Aww.

It’s a small text, in the grand scheme of things, clocking in at around 40,000 words. If you want to see what one historian did with MALLET and a diary kept for 27 years, Cameron Blevins has a very enthusiastic blog post peppered with very nice R visualizations.

Generating red link links for Wikipedia

Red link listsOn Global Women Wikipedia Write-In Day, I was extremely impressed with user Dsp13’s lists of red links — lists of notable women that hadn’t yet been written about on Wikipedia. I used that page as a springboard to write about some notable women in American history, like the wonderful Agnes Surriage Frankland. Dsp13 took these lists of names from resources like Famous American Women: A Biographical Dictionary, signifying the notability of the listed names and giving editors a place to start their research.

I wanted to do something similar for printing history, a research interest of mine. Here is a red link list I cobbled together. As it turns out, there are tons of other red link lists, too! I’m not sure how other people are generating them — probably from databases or other lists already in digital form. (Any info?) But many good resources are in book form, sometimes keyboarded but likely scanned and OCR’d. To make my red link lists, I’m taking indexes from scanned books and generating lists in wiki format for my user page.

Index, wikified and listified
Messy OCR’d book index » cleaned up and put in wiki format » list on Wikipedia

Generating wiki lists from indexes

  1. Find an interesting, useful book with an index that’s been keyboarded or OCR’d. These will likely be on Gutenberg or Internet Archive. (Example.)
  2. Copy/paste the index into a plain text editing program like TextWrangler.
  3. Strip out the unnecessary stuff in the index (like page numbers), manually remove redundant/unimportant lines (optional), and format the list for Wikipedia (switch reversed names, put in *[[title]] formatting, split into columns). I wrote a couple of messy Python scripts for this step.
  4. Copy/paste the resulting text into your user page. 

Now you can get a quick visual of how many of these entities still need to be written up!

Note that some entities might be notable, and some might not be. And of course, some blue-linked wiki pages might not describe the right entity or might lead to a disambiguation page. Regardless, it’s a place to start!

New toy: Raspberry Pi

So, first, let me just say that this toy is new only in that it is newly in my possession. I know I’m behind the game by quite a while. When my software engineer dad excitedly told me he’d gotten on the Raspberry Pi waitlist many moons ago, I said “cool” and that was that.

But then I became an Emerging Technologies Librarian almost entirely by accident, through a series of improbable events. I was caught off-guard but am now diving enthusiastically into technologies both emerging and already emerged, including the Raspberry Pi. What’s nice about being late to the game is that others have blazed forth with inspiring projects.

PiAfter perusing the Adafruit Raspberry Pi tutorials, I got so excited that I bought myself the Raspberry Pi starter kit for a hundred bucks. I went through the easy Gmail LED notifier tutorial and am working on getting sound-making buttons operative. Both mini-projects are introducing me to how the RPi works and what the heck a breadboard is even used for and what resistors do. If there’s one thing I’m enjoying, it’s learning how utterly ignorant I am about such non-complexities as press button, make beep sound. Here I should mention that the RPi was originally designed for children.

And what do I plan to do with this? It usually takes me a while to get confident enough with a new skills set, but I hope to find a handful of fun uses within our Library. I’ve got my eye on a Little Printer, particularly after attending a delightful Library APIs workshop.

But first, to master fundamentals. Press button, make beep sound.

Device and browser testing tools

If you’re impatient, like me, you might agree that the worst part of making an online resource is testing the interface across devices and browsers. The golden rule that I tout is that all websites I make should look their best no matter how they are accessed: at any resolution, on any common mobile or desktop device, in all common browsers up to 7(ish) years old.

This is hard. Stupid hard. There are so many little quirks in how different devices/browsers choose to display even run-of-the-mill HTML, CSS, and JavaScript. So you either design super simple interfaces, or you do the legwork and make sure your code checks out across the spectrum. But how do you do that?

You can choose to own many devices (or virtual machines) and maintain a stable of old and new browser versions, which is tough but gives you the best taste of the user’s experience. Serious devs and tech companies usually have quite a collection of devices.

Your other option: you’ll have to rely on emulation software. My favorites:

Testing an Android device (click to enlarge)
Testing an Android device (click to enlarge)

BrowserStack — 60(?) minutes of free testing, then $19/mo for individuals thereafter. Test your URLs in a ton of browsers and in the most common desktop/mobile platforms. Their emulator is interactive, so you can click, scroll, and type.

Testing IE8 (click to enlarge)
Testing IE8 (click to enlarge)

NetRenderer — old reliable here. Test for free in IE 5.5–10. Paste in a URL and get a screenshot of how it looks on the screen, with the option to offset the screenshot by a number of pixels to peek past the fold.

What device/browser testing tools do you use?

Wearable tech

My UP wristband

As sensors become cheaper and more prevalent, we’re beginning to collect more data from our own movements and the objects around us. Moreover, it’s becoming a less geeky thing to do! I’ve been thinking a bit about how we can use wearable tech for library-specific purposes.

Examples of current or upcoming wearable technology:

  • Google Glass, which others have talked about using in the library for various purposes
  • MYO armband: I’ve preordered this based on their first 3-minute video alone. So far my only library use cases are wowing people during presentations and confounding colleagues by turning their ebook pages from across the hall
  • Pebble Watch: Kickstarter-famous epaper watch connects to your smartphone; first reviews range from good to just okay. Primarily for notifications and activity-tracking, but other uses will reveal themselves when the SDK is released
  • Any others?

Other examples of wearable tech for personal use:

  • Jawbone UP wristband: sleep & activity tracking; I use this and like it
  • Nike FuelBand: activity tracking
  • Sound-activated T-shirts, available at any street fair, strictly for goofballs

Article of interest: 9 trends to watch for in wearable tech