Category: How to

Downloading all the items in an Internet Archive collection using Python

The library where I work and play, Lloyd Sealy Library at John Jay College of Criminal Justice, has had the privilege to have 130+ items scanned and put online by the Internet Archive (thanks METRO! thanks marketing dept at John Jay!). These range from John Jay yearbooks to Alger Hiss trial documents to my favorites, the NYPD Annual Reports (great images and early data viz).

For each scanned book, IA generates master and derivative JPEG2000 files, a PDF, Kindle/Epub/Daisy ebooks, OCR’d text, GIFs, and a DjVu document (see example file list). IA does a great job scanning and letting us do QA, but because they load the content en masse to the internet, there’s no real reason to give us hard copies or a disk drive full of the files. But we do want them, because we want offline access to these digital derivatives of items we own.

The Programming Historian published another fantastic post this month: Data Mining the Internet Archive Collection. In it, Caleb McDaniel walks us through the internetarchive Python library and how to explore and download items in a collection.

I adapted some of his example Python scripts to download all 133 items in John Jay’s IA collection at once, without having to write lots of code myself or visit each page. Awesome! I’ve posted the code to my Github (sorry in advance for having a ‘miscellaneous’ folder, I know that is very bad) and copied it below.

Note that:

  • it will take HOURS to download all items, like an hour each, since the files (especially the master JP2s) can be quite large, plus IA probably controls download requests to avoid overloading their servers.
  • before running, you’ll need to sudo pip install internetarchive in Terminal (if using a Mac) or do whatever is the equivalent with Windows for the internetarchive Python library.
  • your files will download into their own folders, under the IA identifier, wherever you save this .py file

## downloads all items in a given Internet Archive collection
## See http://programminghistorian.org/lessons/data-mining-the-internet-archive for more detailed info

import internetarchive as ia

coll = ia.Search('collection:xxxxxxxx') #fill this in -- searches for the ID of a collection in IA
     ## example of collection page: https://archive.org/details/johnjaycollegeofcriminaljustice
     ## the collection ID for that page is johnjaycollegeofcriminaljustice
     ## you can tell a page is a collection if it has a 'Spotlight Item' on the left

num = 0

for result in coll.results(): #for all items in a collection
     num = num + 1 #item count
     itemid = result['identifier']
     print 'Downloading: #' + str(num) + '\t' + itemid

     item = ia.Item(itemid)
     item.download() #download all associated files (large!)
     print '\t\t Download success.'

LibGuides customization tutorial + CSS template

I posted before about the custom CSS I made for John Jay’s libguides. Here’s a more explicit breakdown of how to create and implement your own CSS (although you’re free to use the chunk of code I provided, too).

At the LACUNY Emerging Tech HTML + CSS workshop that @alevtina and I led this week, we walked through familiarizing ourselves with CSS while experimenting with customizing LibGuides. We used Adobe Dreamweaver, since it was installed on the classroom computers, but I don’t really like Dreamweaver. Typically I use oXygen XML editor ($99 academic price) to edit web code, which is useful because it color-codes tags and tells you if you’ve tried to write something invalid. TextWrangler is also a great plain text editor suitable for HTML/CSS editing (and free!).

How to use custom CSS and LibGuides

'Edit my account' page with admin status
‘Edit my account’ page with admin status
  1. Obtain admin status from whoever’s the god of libguides in your library
  2. Head to Dashboard » Admin Stuff » Look & Feel
  3. Scroll down to Code Customizations – Public Pages
  4. Plant in your custom CSS code here.
  5. You can also use custom header HTML in the next pane, if you’d rather go with that than use a Public Banner in the pane above. (I don’t recommend using both a header and banner.)

CSS template

You’ll have to get familiar with the behind-the-scenes structure of libguides, particularly the identifiers and classes they use to style specific parts of the page. I made a template below (and on Github). It’s a fill-in-the-blank CSS template with identifiers and classes you’ll probably want to change, commented with descriptions of what they are. Many aren’t listed, but you can find them using Inspect Element in Chrome/Firefox or examining a libguide’s source code.

You can test out this template by using a locally saved libguide. I made one for the workshop — you can find it in the folder 3_libguide when you download the workshop materials at lacuny.site44.com. There are a few other good examples in there, too.

<style type="text/css">

.guidetitle { /* title of the guide */
}

#guideattr { /* "Last updated" "URL" etc. */
}

#bc_libguides_login { /* link to Admin Sign In */
}

#tabsI, #tabs12 { /*tabs at top of page, below guide title, above page content*/
}

#tabsI a, #tabs12 a { /* background = outside of tabs (default border on left) */
}

#tabsI a span, #tabs12 a span { /* inside of tabs (default gradient) */
}

#current a span { /*tab of the page you're on*/
}

.dropmenudiv { /*for sub-pages (none on this sample page) */
}

.stitle { /*band below tabs, with title of page and search box*/
}

#guide_tab_title_bar_page_name { /* title of the page you're looking at (same as the tab you're on) */
}

#guide_header_title { /* contains title of guide */
}

.box_comments, .icomments { /*all references to comments */
}

.headerbox { /* headers for boxes of content*/
}

.headerbox h2 { /* title inside headers for boxes of content*/
}

.headerbox h2 a { /* specify colors of header titles, which are links */
}

.innerbox { /* content in boxes */
}

.linkdesc { /* captions beneath links */
}

</style>

Note: Springshare has said they’re rolling out a Bootstrappy LibGuides update soon. No one seems to be sure how deep the overhaul is, or if it’s just another theme. All that to say that these customizations might not last forevs.

Making a Twitter bot in Python (tutorial)

Updated Dec. 2015 to reflect changes on the Twitter Apps page. See bottom of post for even more Twitter bot scripts!

If there’s one thing this budding computational linguist finds delightful, it’s computers that talk to us. From SmarterChild to horse_ebooks to Beetlejuice, I love the weirdness of machines that seem to have a voice, especially when it’s a Twitter bot that adds its murmur to a tweetstream of accounts mostly run by other humans.

CDarwin twitter bot
@cdarwin bot tweets lines from Darwin’s ship log “according to the current date and time so that the Tweets shadow the real world. When it’s the 5th of August here, it’s the 5th August on board ship, albeit 176 years in the past.”

As fun midnight project a few weeks ago, I cobbled together @MechanicalPoe, a Twitter bot that tweets Poe works line by line on the hour from a long .txt file. This slow-tweeting of text is by no means new—@SlowDante is pretty popular, and so is @CDarwin, among many others. In case you want to make your own, here are the quick ‘n’ easy steps I took. This is just one way of doing it—shop around and see what others have done, too.

Step 1. Choose your text & chunk it. (Look, I hate the word chunk as much as the next person, but it’s like, what else are we going to say, nuggetize?) In any case, I chose some texts from Project Gutenberg and copied them into separate .txt files. (Maybe don’t choose a long-winded writer.) I ran a script over them to split them up by sentence and mark sentences longer than 140 characters. (Link to chunking script.) There are other scripts to break up long sentences intelligently, but I wanted to exert some editorial control over where the splits occurred in the texts, so the script I wrote writes ‘SPLIT’ next to long sentences to alert me as I went over the ~600 lines by hand. I copied my chunked texts into one .txt file and marked the beginnings and ends of each individual text. (Link to the finalized .txt file.)

Mechanical Poe literary twitter bot
Baby’s first Twitter bot. Tweets Poe hourly, except when it doesn’t.

Step 2. Set up your Twitter developer credentials. Set up your bot’s account, then get into the Applications manager and create a new app. Click the Keys and Access Tokens tab. You’ll see it already gave you a Consumer Key and Consumer Secret right off the bat. Scroll down to create a new Access Token.

Step 3. Configure script. You’ll have to install Tweepy, which takes advantage of the Twitter API using Python. Now take a look at this super-simple 27-line script I wrote based on a few other scripts elsewhere. This script is also on my Github:


#!/usr/bin/env python
# -*- coding: utf-8 -*-

# by robincamille - for mechnicalpoe

# Tweets a .txt file line by line, waiting an hour between each tweet.
# Must be running all the time, e.g. on a Raspberry Pi, but would be better
# if rewritten to run as a cron task.

import tweepy, time

#Twitter credentials
CONSUMER_KEY = 'xxxxxxxxxxxxxxx'
CONSUMER_SECRET = 'xxxxxxxxxxxxxxx'
ACCESS_KEY = 'xxxxxxxxxxxxxxx'
ACCESS_SECRET = 'xxxxxxxxxxxxxxx'
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)
api = tweepy.API(auth)

#File the bot will tweet from
filename=open('lines.txt','r')
f=filename.readlines()
filename.close()

#Tweet a line every hour
for line in f:
     api.update_status(line)
     print line
     time.sleep(3600) #Sleep for 1 hour</code>

You’ll see that it takes a line from my .txt file, tweets it, and then waits for 3600 seconds (one hour). Fill in your developer credentials, make any changes to the filename and anything else your heart desires.

Step 4. Run script! You’ll notice that this script must always be running—that is, an IDLE window must always be open running it, or a command line window (to run in Terminal, simply write python twitterbot.py, or whatever your filename is). A smarter way would be to run a cron task every hour, and you should probably do that instead, but that requires rewriting the last part of the script. For me, MechanicalPoe runs on my Raspberry Pi, and it’s pretty much the only thing that’s doing now, so it’s fine for it to be running that script 24/7.

This is how Edgar Allan Poe lives on... Note the lovely 3D-printed case made for me by pal Jeff Ginger
This is how Edgar Allan Poe lives on… Note the lovely 3D-printed case made for me by pal Jeff Ginger

Gotchas. So you might encounter some silly text formatting stuff, like encoding errors for quotation marks (but probably not, since the script declares itself UTF-8). You might also make a boo-boo like I did and miss a SPLIT (below) or try to tweet an empty line (you’ll get an error message, “Missing stats”). Also, if you choose a poet like Poe whose lines repeat themselves, Twitter will give you a “Status is a duplicate” error message. I don’t know how long you have to wait to post, but that’s why there are gaps in Mechanical Poe’s Twitter record. The script I wrote is too simple to handle this error elegantly. It just crashes, and when you restart it, you’ll have to specify for line in f[125:]: (whatever line it is in your text file, minus 1) to start there instead.

Twitter bot mistake

Further reading:

Update Dec. 2015: My colleague Mark Eaton and I led a one-day Build Your Own Twitter Bot workshop. We built five ready-made Twitter bots. See the tutorial and get the Python scripts on my GitHub. I updated the above tutorial to reflect a different Apps panel in Twitter, too.

Gentle SEO reminders

I know this dead horse has been beaten. But here are some reminders about things that slip through the cracks.

Every once in a while, google the name and alternate names of your organization and check the universal (not personal) results.

Google results page: before (Note: this is my best approximation. Was too distressed to take a screenshot)
Google results page: before (Note: this is my best approximation. Was too distressed to take a screenshot)
I did this a while ago and was shocked to discover that the one image that showed up next to the results was of someone injecting heroin into their arm! Oh my god! As it turned out, one of our librarians had written a blog post about drug abuse research and that was a book cover or illustration or something. None of us knew about it because why would we google ourselves? Well, now we google ourselves.

Are you the business owner?Claim your location on Google+.

Click the “Are you the business owner?” link (pink in screenshot at right). You’ll have to verify before you can make a basic page. But in doing so, you will have some control over the photos that show up next to the place name. For example, I posted some of my better library photographs to our Google+ page, and they soon replaced the heroin arm.

Demote sitelinks as necessary.

Sitelinks are the sub-categories that show up beneath the top search result. In our case, it’s things like ‘Databases’ and ‘How to find books’ — appropriate for a library. But there were also some others, like ‘Useful internet links’ (circa 2003) that were no longer being updated, so once verified as webmasters, we demoted them.

Check out your reviews.

Since place-based search is the thing now, you’d better keep tabs on your Foursquare, Google, and other reviews pages. For one thing, it’s great to identify pain points in your user experience, since we are now trained to leave passive-aggressive complaints online rather than speak to humans. Example: our Foursquare page has a handful of grievances about staplers and people being loud. Okay, so no surprise there, but we’re trying to leave more positive tips as the place owners so that people see The library offers Library 101 workshops every fall when they check in, not Get off the damn phone! (verbatim).

Add to your front-page results.

If there are irrelevant or unsatisfactory search results when you look up your organization, remember that you have some form of control. Google loves sites like Wikipedia, Twitter, YouTube, etc., so establishing at least minimal presences on those sites can go far.

Meta tags.

Okay, so this is SEO 101. But I surprised myself this morning when I realized, oh dear, we don’t have a meta description. The text of our search result was our menu options. Turns out Drupal (and WordPress) don’t generate meta tags by default. You’ll have to stick them in there manually or install a module/plug-in. Also, you’ll want to use OpenGraph meta tags now. These will give social sites more info about what to display. They look like this:

<meta property="og:title" content="Lloyd Sealy Library at John Jay College of Criminal Justice"/>
<meta property="og:type" content="website"/>
<meta property="og:locale" content="en_US"/>
<meta property="og:site_name" content="Lloyd Sealy Library at John Jay College of Criminal Justice"/>
<meta property="og:description" content="The Lloyd Sealy Library is central to the educational mission of John Jay College of Criminal Justice. With over 500,000 holdings in social sciences, criminal justice, law, public administration, and related fields, the Library's extensive collection supports the research needs of students, faculty, and criminal justice agency personnel."/>

All right, good luck. Here’s hoping you don’t have photos of explicit drug use to combat in your SEO strategy.

P.S. If you use the CUNY commons, try the Yoast WordPress SEO plugin. It is really configurable, down the post-level.

Python + BeautifulSoup + Twitter + Raspberry Pi

In my ongoing experiments with my Raspberry Pi, I’ve been looking for small ways it can be useful for the library. I’ve been controlling my Pi remotely using SSH in Terminal (tutorial — though you’ll have to note your Pi’s IP address first). As I noted yesterday, I’ve been making it tweet, but was looking to have it share information more interesting than a temperature or light reading. So now I have the Pi tweeting our library’s hours on my test account:

Tweeting library hours

To do this, I installed BeautifulSoup, a Python library for working with HTML. My Python script uses BeautifulSoup to search the library’s homepage and find two spans with the classes date-display-start and date-display-end. (This is looking specifically at a view in Drupal that displays our daily hours.) Then it grabs the content of those spans and plunks it into a string to tweet. Here’s the script:

#!/usr/bin/env python
import tweepy
from bs4 import BeautifulSoup
import urllib3

CONSUMER_KEY = '********************' #You'll have to make an application for your Twitter account
CONSUMER_SECRET = '********************' #Configure your app to have read-write access and sign in capability
ACCESS_KEY = '********************'
ACCESS_SECRET = '********************'

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)
api = tweepy.API(auth)

http = urllib3.PoolManager()

web_page = http.request('GET','http://www.lib.jjay.cuny.edu/')
web_page_data = web_page.data

soup = BeautifulSoup(web_page_data)
openh = soup.find('span','date-display-start') #spans as defined in Drupal view
closedh = soup.find('span','date-display-end')
other = soup.find('span','date-display-single')

if openh: #if library is open today, tweet and print hours
openh = openh.get_text() + ' to '
closedh = closedh.get_text()
api.update_status("Today's Library hours: " + openh + closedh + '.')
print "Today's Library hours: " + openh + closedh + '.'
elif other: #if other message (eg Closed), tweet and print
other = other.get_text()
api.update_status("Today's Library hours: " + other + '.')
print "Today's Library hours: " + other + '.'
else:
print "I don't know what to do."

Python libraries used:

I’ve configured cron to post at 8am every morning:

sudo crontab -e
[I added this line:]
00 8 * * * python /home/pi/Projects/Twitter/libhours-johnjaylibrary.py

Notes: I looked at setting up an RSS feed based on the Drupal view, since web scraping is clunky, but no dice. Also, there’s no real reason why automated tweeting has to be done on the Pi rather than a regular ol’ computer, other than I’d rather not have my iMac on all the time. And it’s fun.