CollectiveAccess workflow

I’ve gotten a few emails lately from other library/archive organizations asking about how we use CollectiveAccess, open-source software for digital collections at museums, archives, and libraries. Our Digital Collections at John Jay launched earlier this year and runs on CollectiveAccess. We’re really happy with it! Since it’s designed for archival-style content from the get-go, there are a lot of really nice library-friendly touches.

For those considering CollectiveAccess, it might be helpful to see what it looks like to use the software. CollectiveAccess takes a good amount of elbow grease to get up and going (more than Omeka, for instance), but the workflows are pretty straightforward once your installation is stable.

Uploading objects to CollectiveAccess

So how exactly do you populate your CollectiveAccess site? First, I’ll define a few special words that CollectiveAccess uses:

object: the thing you digitized. E.g., a photograph, a book, a document. Our rule of thumb is that one physical object = one digital object. Each object is of one type…

object type: what category is the thing? This will affect what metadata fields you’ll fill in. For instance, our object type “Trial transcript” has fields for “court” and “stenographer’s number,” which only apply to this object type.

media representation: an uploaded file. One object can have multiple representations. A photograph-type object might have two media representations: scans of the front and back. Or an oral history might have a PDF and several audio clips.

collection: the conceptual group that contains objects. A collection can have multiple objects. Again, our rule of thumb is one physical collection in the archives = one digital collection. Makes it easy! Makes total sense! (Okay, sometimes we fudge a little.) See our list of collections in our Digital Collections.

Note: the workflows below are just how we use the software. Other places may differ. But it’s useful to see examples. This also assumes that you’re logged into the back end and your metadata schema are good to go.

Screenshot of CollectiveAccess, editing a single object

Screenshot of CollectiveAccess, editing a single object (click for larger)

Our workflow for uploading objects one at a time

Example: we had student workers create the John Jay College Archives collection by scanning and inputting metadata, one thing at a time (reviewed later by librarians)

  1. Click “New object” in CollectiveAccess, choosing appropriate object type 
  2. Write in metadata, either basic or complete, following your organization’s conventions 
  3. Upload object (can’t be done first, as uploaded item must have identifier to latch on to, assigned in step 2) 
  4. Review, then make publicly accessible 
Template for data import in CollectiveAccess

Template for data import in CollectiveAccess. This works in conjunction with another spreadsheet that has metadata related to cases on it. (Click to see larger image, or email me for more example templates)

Our workflow for batch uploads, when we already have all metadata and media files

Example: migrating files and metadata out of an old database, which is what we’re currently doing for our trial transcripts collection

  1. Batch-upload metadata, using the filename as identifier 
    • data import for CA is complicated to understand at first, but once you get your spreadsheets and templates in order, it’s amazing and fast
    • this step creates a bunch of objects that don’t have media files attached to them (they’re just records) 
    • you might have to do multiple data imports,to split up big data or because you have complicated data (e.g., we have lots of overlapping person data: defendants, judges, etc.)
  2. Batch-upload files, matching on filename to existing objects. Takes a while
  3. Review, then make publicly accessible

When you upload a file to CollectiveAccess, it can take a while because it creates a lot of derivatives. For example, one uploaded photo generates all these files:

Screenshot of derative filenames from CollectiveAccess

Screenshot of derivative filenames from CollectiveAccess

It also stores the original file, though it’s up to you to decide which derivative you allow users to download, if any. Our users can view objects in high resolution (in a special image viewing frame) and download full PDFs, but can only download medium-size JPGs for images. For print quality-size images, a user must contact our Special Collections librarian. This ensures accurate citations.

NYC-area CollectiveAccess events

The CollectiveAccess software is made right here in the city! In September, the friendly CollectiveAccess developers led a workshop at METRO that walked us through configuring new CA installations and importing sample data. The workshop materials are still online and are incredibly useful in piecing together the data import process.

I’m the convener of the CollectiveAccess User Group here in NYC. Our next meeting is Monday, December 1, 2014 at 10am at METRO. We’ll get behind-the-scenes tours of CollectiveAccess installations at Brooklyn Navy Yard, Roundabout Theatre Company, Booklyn, and New York Society Library. The CA team attends User Group meetings, too, and is as helpful and responsive in person as they are in the support forums. If you’re interested in CollectiveAccess, register for free & join us at METRO!

What did I do this year? 2013–14 edition

Voyant word cloud of 2013-14 activities

It’s self-evaluation time again! It’s the second year I’ve had to do this for my current job. Last year, I found it enormously helpful to quantify and visualize the activities I’d done in the given time period. I use the daily “Done today” entries I write in Evernote, a Python script I wrote last year, the BeautifulSoup Python library, and Voyant Tools to get a holistic look at what I did this year.

Voyant Tools allow different views of the data. One is Cirrus word clouds. (When used in combo with other data tools, word clouds are useful.) The image at the top of this entry is the word cloud that ignores common stop words, my colleagues’ names, and the words ref, desk, email/s/ed, met, meeting, talked, hr (hour), and sent.

Here’s the word cloud that only ignores common stop words:

Continue reading

Analyzing EZproxy logs with Python

We use EZproxy to provide off-campus users with access to subscription resources that require a campus-specific login. Every time a user visits an EZproxy-linked page (mostly by clicking on a link in our list of databaes), that activity is logged. The logs are broken up monthly as either complete (~1 GB for us) or abridged (~10 MB). The complete logs look something like this:

ezproxy log guide

EZproxy log snippet example — click to enlarge

The complete logs log almost everything, including all the JavaScript and favicons loaded onto the page the user signs into. Hence why they are a gig large. The abridged logs have the same format as the illustration above, but keep only the starting point URLs (SPUs) and are much easier to handle. (Note that your configuration of EZproxy may differ from mine — see OCLC’s log format guide.)

We can get pretty good usage stats from the individual database vendors, but with monthly logs like these, why not analyze them yourself? You could do this in Excel, but Python is much more flexible, and much faster, and also, I’ve already written the script for you. It very hackily analyzes on- vs. off-campus vs. in-library use, as well as student vs. faculty use.

Use it on the command line like so:
python [directory to analyze] [desired output filename.csv]

Screen Shot 2014-04-22 at 11.46.10 AM

Run it over the SPU logs, as that’ll take much less time and will give you a more useful connection count — that is, it will only count the “starting point URL” connections, rather than every single connection (javascript, .asp, favicon, etc.), which may not tell you much.

The script will spit out a CSV that looks like this:

ezproxy analysis script output

With which you can then do as you please.


  • “Sessions” are different from “connections.” Sessions are when someone logs into EZproxy and does several things; a connection is a single HTTP request. Sessions can only be tracked if they’re off-campus, as they rely on a session ID. On-campus EZproxy use doesn’t get a session ID and so can only be tracked with connections, which are less useful. On-campus use doesn’t tell us anything about student vs. faculty use, for instance.
  • Make sure to change the IP address specifications within the script. As it is, it counts “on campus” as IP addresses beginning with “10.” and in-library as beginning with “10.11.” or “10.12.”
  • This is a pretty hacky script. I make no guarantees as to the accuracy of this script. Go over it with a fine-toothed comb and make sure your output lines up with what you see in your other data sources.
  • Please take a good look at the logs you’re analyzing and familiarize yourself with them — otherwise you may get the wrong idea about the script’s output!
  • Things you could add to the script: analysis of SPUs; time/date patterns; …

Preliminary findings at John Jay

Here’s one output of the data I made, with the counts of on-campus, off-campus, and in-library connections pegged by month from July 2008 to preset, overlaid with lines of best fit:

Click for larger

Click for larger

Off-campus connection increase: Between 2008 and 2014, database use off-campus saw an increase of ballpark 20%. Meanwhile, on-campus use has stayed mostly the same, and library use has dropped by ballpark 15%, although I think I must not be including a big enough IP range, since we’ve seen higher gate counts since 2008. Hm.

Variance: As you can see by the squigglies in the wild ups and downs of the pale lines above, library resource use via EZproxy varies widely month to month. Extreme troughs are obviously when school is not in session. Compared to January, we usually get over 3x the use of library resources in November. The data follows the flow of the school year.

Students vs. faculty: When school is in session, EZproxy use is 90% students and 10% faculty. When school is not in session, those percentages pretty much flip around. (Graph not shown, but it’s boring.) By the numbers, students do almost no research when class is not in session. Faculty are constantly doing research, sometimes doing more when class is not in session.

Data issues: The log for December 2012 is blank. Boo. Throws off some analyses.

If you have suggestions or questions about the script, please do leave a comment!

Downloading all the items in an Internet Archive collection using Python

The library where I work and play, Lloyd Sealy Library at John Jay College of Criminal Justice, has had the privilege to have 130+ items scanned and put online by the Internet Archive (thanks METRO! thanks marketing dept at John Jay!). These range from John Jay yearbooks to Alger Hiss trial documents to my favorites, the NYPD Annual Reports (great images and early data viz).

For each scanned book, IA generates master and derivative JPEG2000 files, a PDF, Kindle/Epub/Daisy ebooks, OCR’d text, GIFs, and a DjVu document (see example file list). IA does a great job scanning and letting us do QA, but because they load the content en masse to the internet, there’s no real reason to give us hard copies or a disk drive full of the files. But we do want them, because we want offline access to these digital derivatives of items we own.

The Programming Historian published another fantastic post this month: Data Mining the Internet Archive Collection. In it, Caleb McDaniel walks us through the internetarchive Python library and how to explore and download items in a collection.

I adapted some of his example Python scripts to download all 133 items in John Jay’s IA collection at once, without having to write lots of code myself or visit each page. Awesome! I’ve posted the code to my Github (sorry in advance for having a ‘miscellaneous’ folder, I know that is very bad) and copied it below.

Note that:

  • it will take HOURS to download all items, like an hour each, since the files (especially the master JP2s) can be quite large, plus IA probably controls download requests to avoid overloading their servers.
  • before running, you’ll need to sudo pip install internetarchive in Terminal (if using a Mac) or do whatever is the equivalent with Windows for the internetarchive Python library.
  • your files will download into their own folders, under the IA identifier, wherever you save this .py file

## downloads all items in a given Internet Archive collection
## See for more detailed info

import internetarchive as ia

coll = ia.Search('collection:xxxxxxxx') #fill this in -- searches for the ID of a collection in IA
     ## example of collection page:
     ## the collection ID for that page is johnjaycollegeofcriminaljustice
     ## you can tell a page is a collection if it has a 'Spotlight Item' on the left

num = 0

for result in coll.results(): #for all items in a collection
     num = num + 1 #item count
     itemid = result['identifier']
     print 'Downloading: #' + str(num) + '\t' + itemid

     item = ia.Item(itemid) #download all associated files (large!)
     print '\t\t Download success.'

Data Viz Hack Day Resources

LACUNY Em Tech Committee:

Data Viz Hack Day!

February 18, 2014
John Jay College of Criminal Justice
Shortlink to this page:

Resources for beginning & intermediate data visualizers:

Abstract visualization of John Jay's research network

Abstract visualization of John Jay’s research network



Data sources