Downloading all the items in an Internet Archive collection using Python

The library where I work and play, Lloyd Sealy Library at John Jay College of Criminal Justice, has had the privilege to have 130+ items scanned and put online by the Internet Archive (thanks METRO! thanks marketing dept at John Jay!). These range from John Jay yearbooks to Alger Hiss trial documents to my favorites, the NYPD Annual Reports (great images and early data viz).

For each scanned book, IA generates master and derivative JPEG2000 files, a PDF, Kindle/Epub/Daisy ebooks, OCR’d text, GIFs, and a DjVu document (see example file list). IA does a great job scanning and letting us do QA, but because they load the content en masse to the internet, there’s no real reason to give us hard copies or a disk drive full of the files. But we do want them, because we want offline access to these digital derivatives of items we own.

The Programming Historian published another fantastic post this month: Data Mining the Internet Archive Collection. In it, Caleb McDaniel walks us through the internetarchive Python library and how to explore and download items in a collection.

I adapted some of his example Python scripts to download all 133 items in John Jay’s IA collection at once, without having to write lots of code myself or visit each page. Awesome! I’ve posted the code to my Github (sorry in advance for having a ‘miscellaneous’ folder, I know that is very bad) and copied it below.

Note that:

  • it will take HOURS to download all items, like an hour each, since the files (especially the master JP2s) can be quite large, plus IA probably controls download requests to avoid overloading their servers.
  • before running, you’ll need to sudo pip install internetarchive in Terminal (if using a Mac) or do whatever is the equivalent with Windows for the internetarchive Python library.
  • your files will download into their own folders, under the IA identifier, wherever you save this .py file

## downloads all items in a given Internet Archive collection
## See http://programminghistorian.org/lessons/data-mining-the-internet-archive for more detailed info

import internetarchive as ia

coll = ia.Search('collection:xxxxxxxx') #fill this in -- searches for the ID of a collection in IA
     ## example of collection page: https://archive.org/details/johnjaycollegeofcriminaljustice
     ## the collection ID for that page is johnjaycollegeofcriminaljustice
     ## you can tell a page is a collection if it has a 'Spotlight Item' on the left

num = 0

for result in coll.results(): #for all items in a collection
     num = num + 1 #item count
     itemid = result['identifier']
     print 'Downloading: #' + str(num) + '\t' + itemid

     item = ia.Item(itemid)
     item.download() #download all associated files (large!)
     print '\t\t Download success.'

Data Viz Hack Day Resources

LACUNY Em Tech Committee:

Data Viz Hack Day!

February 18, 2014
John Jay College of Criminal Justice
Shortlink to this page: bit.ly/emtviz

Resources for beginning & intermediate data visualizers:


Abstract visualization of John Jay's research network

Abstract visualization of John Jay’s research network


Inspiration

Tutorials

Data sources

Tools

Find a book by call number (bookmark template)

How do I find a book by call number? bookmark

I’ve designed a bookmark for my library to help undergrads find books by call number. It’s a complex concept, so a handheld guide is useful. Our main use case is explaining call numbers to students at the Reference Desk using this bookmark as a visual aid. Our stacks include floor maps and (soon) posters explaining call numbers in a more visual way.

If you’d like to modify the bookmark for your institution, here’s the template for Adobe InDesign. This template is free to use and modify without attribution by anybody in the universe (CC0). Requires Adobe InDesign and the Helvetica font. I’d appreciate any feedback or suggestions!

bookmark_call-number_template.indd (4 MB)

Or if you just want to grab the graphic and you have some editing software, here’s a 300ppi PNG (click for full image):

bookmark_find-book

How do I find a call number in CUNY+?The bookmark is somewhat CUNY-specific — in step one, I’ve made a mock of how a book record looks like in our catalog, CUNY+. The template helpfully points out what to change when modifying it for your library.

And! It’s a two-fer! You also get the How do I find a call number? bookmark to the left, which is very CUNY-specific but might be a good template to follow. (You’ll get a “missing links” error for the screenshots in this one.)

If you don’t have InDesign, you can grab the text of the bookmark below.


How do I find a book on the shelf?

Step 1. First, find the book’s general location and call number in the catalog. Example:

Library Location Call number Item type Item status
John Jay College Stacks PQ7797.B635 1984 Regular loan (book can be borrowed) Look on shelf (book is available)

Step 2. Then find the book on the shelves by its call number.

Stacks See floor map to find shelf section.
PQ Find Ps, then find PQ alphabetically.
7797 In the PQs, find 7,797. Read as a whole number.
.B635 Find the Bs in the PQ7797 area, then 635 in digit order.
The number is a decimal: .B6 occurs after .B599. May be two-part.
1984 Years are arranged chronologically.

Call number: the “address” that tells you where in the Library a book is located. It’s ordered general → specific.

Can’t find it? Have questions? Ask at the Reference Desk!

Shoutout to all the helpful feedback I got on Twitter and from my colleagues at John Jay! More suggestions welcome in the comments.

Successful social media in our library + using Bit.ly

We’ve upped our social game this academic year since an inspiring LACUNY talk in September 2013. On our library’s Facebook, Twitter, and Instagram, we follow a schedule of Mug Shot Mondays and Throwback Thursdays (#tbt), with other posts peppered in between. #tbt has been super successful on Facebook, in terms of views and clicks, especially since the main college account often re-shares our posts.

Facebook insights screenshot

Facebook insights December 2013 to January 2014 (I took a 3-week break, hence sporadic posts).
Blue bar = clicks on content; pink bar = Likes, comments, and shares

Our posts have been genuine geek-outs (how cool are these old photos!), but they’ve also been diagnostics and test runs. The students don’t know it yet, but we’ll be leveraging the popularity of our weekly posts to promote our upcoming Digital Collections site and next year’s 50th Anniversary Exhibit. What works? What doesn’t work?

We’re realists — we know that our visual posts are probably one “oh, that’s cool” blip in our students’ Facebook feeds. But as optimists, we always include a relevant link (often in a subscription database) and a source link (to our Special Collections pages), with the hope that we’ll serendipitously inspire further research and interest in our unique materials.

A successful #tbt post

A successful Throwback Thursday post

Facebook’s insights page can give us a pretty good idea of whether people are clicking through to the links we provide. If the link goes to a page on our servers, Google Analytics will also record that click-through. But there’s one more way that I like to track the effectiveness of our links.

Using bit.ly to track success of social media posts

Bit.ly admin page

Bit.ly admin page

You can’t see it in the screenshot above, but Colonel Sandusky’s bio from the Facebook photo post got 5 clicks. The shortlink to our Archives page has 42 clicks total, from all of our Archives-related Facebook posts.

Three advantages of Bit.ly:

  • The shortlinks (e.g., bit.ly/jjpexp) look nice in short posts, especially compared to our enormous EZproxy links
  • If you need to include a long link on a poster or slide, a shortlink will make your viewers happy
  • With an account, you can see how often a bit.ly link has been clicked

Three drawbacks to Bit.ly:

  • You can’t export a spreadsheet, to my knowledge, so you’d have to cobble together data if you want a big-picture view. But for a quick peek, it works great
  • You can’t submit a link more than once. So our Archives link has 40+ clicks on 5+ posts
  • If you click on the link yourself, even from the admin view, that adds a click to your stats, giving you a distorted view

Two tips for using Bit.ly:

  • See the pencil next to the short link? That means you can customize the link! As you can see, ours in the above image are jjnewslet, jjdcpeek, jjhamby, and mapcrime. Much more human-friendly than something like 1Xoj5nW. (Please customize your shortlink if you’re putting it on a slide or poster!)
    bitly edit  Yikes! »»»  bitly edited  Much better.
  • Edit the link’s title and/or add a note on your admin view to remind yourself where/why each link is listed. Do this especially if your link has an EZproxy prefix, otherwise every link will be title “Log in with your xxxx username…”
    bitly edited entry

Drawing preliminary conclusions, even our most popular Facebook posts don’t bring in many click-throughs. A little disappointing, but that’s to be expected. People use Facebook when they want to be distracted and scroll quickly through brief diversions, not necessarily when they want to dive deeply into a topic.

Views and clicks are only one measure of success in social media. These numbers are the easiest to track and give the quickest gratification after the effort you put in. But true outreach means increased use and improved perception of the library, which is much harder to quantify at a granular level. (Suggestions?)

I’ll keep updating with other tales and tips for success in social media in our library. Other tips and examples are welcome!

What are emerging tech librarians into this year?

This week, I attended my favorite committee meeting, the LACUNY Emerging Technologies Committee, which I co-chair with Allie Verbovetskaya and Steve Zweibel. We planned out a great semester of workshops and hackathon-style work days by referring to a long list of topics we’ve been compiling. While we wish we could cover everything in a semester, we could only pick a few. But perhaps you’d find it interesting to see this list! What kinds of emerging (or emerged) technologies are librarians into?

Bold italic = we did this last semester.
Bold = we’re doing it this semester.

  • 3D printing
  • Augmented reality: Oculus Rift, Google Glass
  • AutoHotkey
  • Backup best practices
  • Clojure
  • CMS tours: behind the scenes of Drupal, Omeka, &c
  • Collaborative tools (e.g., Google Docs, Editorially)
  • Data structures, normalization
  • Data viz hackathon (ft. Gephi, R, D3)
  • eResource mgmt: SFX, SerialsSolutions
  • Gaming software
  • GIS
  • Google Analytics, beyond SEO
  • HTML & CSS for library web services
  • LaTeX
  • Legacy computing/computers
  • LibGuides API
  • Makerspace tour + happy hour
  • Mapping your library
  • MARCedit
  • Microcomputing: RPi, Arduino, Makey Makey!
  • MySQL / XAMPP
  • Pedagogical design software for teaching critical information literacy skills
  • PHP
  • Preparing to accept digital archival materials
  • Python & MARC
  • Python hackathon (ft. CSVs, regexes)
  • R
  • Raspberry Pi
  • Regular expressions
  • Responsive web design
  • RFID
  • Ruby
  • Semantic Web/Linked Data
  • SPSS
  • Tacit knowledge (e.g., keyboard shortcuts)
  • Twitter bootstrap implementation
  • Usability testing
  • Version control (Git, SVN)
  • Video tutorial creation & editing
  • Web frameworks: Node.Js, Twitter Bootstrap, HTML5 Boiler plate
  • Wikipedia: sponsoring edit-a-thons and/or generating traffic to your library’s resources
  • WorldCat API
  • XML (simple editing)

We also held a popular Demo Day last semester where any CUNY librarian could share a digital project they’ve been working on, big or small, finished or in progress. We’ll be doing that each semester.

What’s missing from this list, readers?

Suggested by Rob Sanderson: linked data