Category: Tool

June 26, 2013

Introducing myself to MALLET

Backstory

In my text mining class at GSLIS, we had a lot of ground to cover. It was easy enough to jump into Oracle SQL Developer and Data Miner and plug into the Oracle database that had been set up for us, and we moved on to processing and classifying merrily. But now, a year later, I’m totally removed from that infrastructure. I wanted to review my work from that class before heading to EMDA next (!) week, but reacquainting myself with Data Miner would require setting up the whole environment first. Not totally understanding the Oracle ecosystem, I thought it would be easy enough to set a VirtualBox and implement the Linux setup as needed, but after several failures I gave up and decided to try something new. As it turns out, MALLET not only does classification, but topic modeling, too — something I’d never done before.

What is?

Here’s how I understand it: topic modeling, like other text mining techniques, considers text as a ‘bag of words’ that is more or less organized. It draws out clusters of words (topics) that appear to be related because they statistically occur near each other. We’ve all been subjected to wordles — this is like DIY wordles that can get very specific and can seem to approach semantic understanding with statistics alone.

One tool that DH folks mention often is MALLET, the MAchine Learning for LanguagE Toolkit, open-source software developed at UMass Amherst starting in 2002. I was pleased to see that it not only models topics, but does the things I’d wanted Oracle Data Miner to do, too — classify with decision trees, Naïve Bayes, and more. There are many tutorials and papers written on/about MALLET, but the one I picked was Getting Started with Topic Modeling and MALLET from The Programming Historian 2, a project out of CHNM. The tutorial is very easy to follow and approaches the subject with a DH-y literariness.

Exploration

One of my favorite small test texts is personally significant to me — my grandmother’s diary from 1943, which she kept as a 16-year-old girl in Hawaii. I transcribed it and TEI’d it a while ago. I split up my plain-text transcript by month, stripped month and day names from the text (so words wouldn’t necessarily cluster around ‘april’), and imported the 12 .txt files into MALLET. Following the tutorial’s instructions, I ran train-topics and came out with data like this:

January home diary school ve today feel god parents war eyes hours friends make esther changed beauty class true man	February dear girls thing taxi job find wouldn afraid filipino year american beauty live woman movies happened shoes family makes	March papa don mommy asuna men americans nature realize simply told voice world bus skin ha ago japanese blood diary	April dear diary town made white fun dressed learn sun hour days rest week blue soldiers navy kids straight pretty	May dear girls thing taxi job find wouldn afraid filipino year american beauty live woman movies happened shoes family makes	June papa don mommy asuna men americans nature realize simply told voice world bus skin ha ago japanese blood diary
July red day leave dance min insular top idea half country lose realized servicemen lot breeze ahead appearance change lie	August betty wahiawa taxi set show mr wanted party mama ve wrong insular helped played dinner food chapman fil hawaiian	September betty wahiawa taxi set show mr wanted party mama ve wrong insular helped played dinner food chapman fil hawaiian	October johnny rose nice supper breakfast tiquio lunch lydia office ll raymond theater tonight doesn tomorrow altar kim warm forget	November didn left papa richard long met told house back felt sat gave hand don sweet called meeting dress miss	December ray lydia dorm bus lovely couldn caught ramos asked kissed park waikiki close st arm loved xmas held world

Note that some clusters appear twice. MALLET considers the directory of .txt files as its whole corpus, then spits out which clusters each file is most closely associated with.

As you can see, I should really have taken out ‘dear’ and ‘diary.’ But I can see that these clusters make sense. She begins the diary in mid-January. It’s her first diary ever, so she tends first toward the grandiose, talking about changes in technology and what it means to be American, and later begins to write about the people in her life, like Betty, her roommate, and Tiquio, the creepy taxi driver. In almost all of the clusters, the War shows up somehow. But what I was really looking forward to was seeing how her entries’ topics changed in December, when she began dating Ray, the man who would be my grandfather. Aww.

It’s a small text, in the grand scheme of things, clocking in at around 40,000 words. If you want to see what one historian did with MALLET and a diary kept for 27 years, Cameron Blevins has a very enthusiastic blog post peppered with very nice R visualizations.

May 24, 2013

Generating red link links for Wikipedia

On Global Women Wikipedia Write-In Day, I was extremely impressed with user Dsp13’s lists of red links — lists of notable women that hadn’t yet been written about on Wikipedia. I used that page as a springboard to write about some notable women in American history, like the wonderful Agnes Surriage Frankland. Dsp13 took these lists of names from resources like Famous American Women: A Biographical Dictionary, signifying the notability of the listed names and giving editors a place to start their research.

I wanted to do something similar for printing history, a research interest of mine. Here is a red link list I cobbled together. As it turns out, there are tons of other red link lists, too! I’m not sure how other people are generating them — probably from databases or other lists already in digital form. (Any info?) But many good resources are in book form, sometimes keyboarded but likely scanned and OCR’d. To make my red link lists, I’m taking indexes from scanned books and generating lists in wiki format for my user page.

Messy OCR’d book index » cleaned up and put in wiki format » list on Wikipedia

Generating wiki lists from indexes

Find an interesting, useful book with an index that’s been keyboarded or OCR’d. These will likely be on Gutenberg or Internet Archive. (Example.)
Copy/paste the index into a plain text editing program like TextWrangler.
Strip out the unnecessary stuff in the index (like page numbers), manually remove redundant/unimportant lines (optional), and format the list for Wikipedia (switch reversed names, put in *[[title]] formatting, split into columns). I wrote a couple of messy Python scripts for this step.
Copy/paste the resulting text into your user page.

Now you can get a quick visual of how many of these entities still need to be written up!

Note that some entities might be notable, and some might not be. And of course, some blue-linked wiki pages might not describe the right entity or might lead to a disambiguation page. Regardless, it’s a place to start!

May 16, 2013

New toy: Raspberry Pi

So, first, let me just say that this toy is new only in that it is newly in my possession. I know I’m behind the game by quite a while. When my software engineer dad excitedly told me he’d gotten on the Raspberry Pi waitlist many moons ago, I said “cool” and that was that.

But then I became an Emerging Technologies Librarian almost entirely by accident, through a series of improbable events. I was caught off-guard but am now diving enthusiastically into technologies both emerging and already emerged, including the Raspberry Pi. What’s nice about being late to the game is that others have blazed forth with inspiring projects.

After perusing the Adafruit Raspberry Pi tutorials, I got so excited that I bought myself the Raspberry Pi starter kit for a hundred bucks. I went through the easy Gmail LED notifier tutorial and am working on getting sound-making buttons operative. Both mini-projects are introducing me to how the RPi works and what the heck a breadboard is even used for and what resistors do. If there’s one thing I’m enjoying, it’s learning how utterly ignorant I am about such non-complexities as press button, make beep sound. Here I should mention that the RPi was originally designed for children.

And what do I plan to do with this? It usually takes me a while to get confident enough with a new skills set, but I hope to find a handful of fun uses within our Library. I’ve got my eye on a Little Printer, particularly after attending a delightful Library APIs workshop.

But first, to master fundamentals. Press button, make beep sound.

May 10, 2013

Device and browser testing tools

If you’re impatient, like me, you might agree that the worst part of making an online resource is testing the interface across devices and browsers. The golden rule that I tout is that all websites I make should look their best no matter how they are accessed: at any resolution, on any common mobile or desktop device, in all common browsers up to 7(ish) years old.

This is hard. Stupid hard. There are so many little quirks in how different devices/browsers choose to display even run-of-the-mill HTML, CSS, and JavaScript. So you either design super simple interfaces, or you do the legwork and make sure your code checks out across the spectrum. But how do you do that?

You can choose to own many devices (or virtual machines) and maintain a stable of old and new browser versions, which is tough but gives you the best taste of the user’s experience. Serious devs and tech companies usually have quite a collection of devices.

Your other option: you’ll have to rely on emulation software. My favorites:

Testing an Android device (click to enlarge)

BrowserStack — 60(?) minutes of free testing, then $19/mo for individuals thereafter. Test your URLs in a ton of browsers and in the most common desktop/mobile platforms. Their emulator is interactive, so you can click, scroll, and type.

Testing IE8 (click to enlarge)

NetRenderer — old reliable here. Test for free in IE 5.5–10. Paste in a URL and get a screenshot of how it looks on the screen, with the option to offset the screenshot by a number of pixels to peek past the fold.

What device/browser testing tools do you use?

April 24, 2013

Wearable tech

My UP wristband

As sensors become cheaper and more prevalent, we’re beginning to collect more data from our own movements and the objects around us. Moreover, it’s becoming a less geeky thing to do! I’ve been thinking a bit about how we can use wearable tech for library-specific purposes.

Examples of current or upcoming wearable technology:

Google Glass, which others have talked about using in the library for various purposes
MYO armband: I’ve preordered this based on their first 3-minute video alone. So far my only library use cases are wowing people during presentations and confounding colleagues by turning their ebook pages from across the hall
Pebble Watch: Kickstarter-famous epaper watch connects to your smartphone; first reviews range from good to just okay. Primarily for notifications and activity-tracking, but other uses will reveal themselves when the SDK is released
Any others?

Other examples of wearable tech for personal use:

Jawbone UP wristband: sleep & activity tracking; I use this and like it
Nike FuelBand: activity tracking
Sound-activated T-shirts, available at any street fair, strictly for goofballs

Article of interest: 9 trends to watch for in wearable tech

Emerging Tech in Libraries