Tag: digital collections

August 29, 2016

CollectiveAccess importing workflow

This step-by-step workflow illustrates how I import objects (metadata + files) into CollectiveAccess. I’m writing this post partly to give others an idea of how to import content into CollectiveAccess — but mainly it’s for my future self, who will likely have forgotten!

Caveats: Our CollectiveAccess instance is version 1.4, so some steps or options might not be the same for other versions. This is also just a record of what we at John Jay do when migrating/importing collections, so the steps might have to be different at your institution.

Refer to the official CollectiveAccess documentation for much more info: metadata importing and batch-uploading media. These are helpful and quite technical.

CollectiveAccess importing steps

Do all of these steps in a dev environment first to make sure everything is working, then do it for your live site.

Create Excel spreadsheet of metadata to migrate
- Here’s our example (.xlsx) from when we migrated some digitized photos from an old repo to CA
- This can be organized however you want, though it may be easiest for each column to be a Dublin Core field. In ours, we have different fields for creators that are individuals vs. organizations.
Create another Excel spreadsheet that will be the “mapping template” aka “importer”
- Download the starter template (.xlsx) from CA wiki. This whole step is hard to understand, by the way, so set aside some time.
- Here’s our example (.xlsx), which refers to the metadata spreadsheet above.
- Every number in the “Source” column refers to the metadata spreadsheet: 1 is column A, 2 is B, …
- Most of these will be Mapping rules, e.g. if Column A is the title of the object, the rule type would be Mapping, Source would be 1, and CA table element would be ca_objects.preferred_labels
  - Get the table elements from within CA (requires admin account): see Manage → Administration → User interfaces → Your Object Editor [click page icon] → Main Editor [click page icon] → Elements to display on this screen
  - Example row:
    
    Rule type Source CA table.element Options
    
    Mapping 9 ca_objects.lcsh {“delimiter”: “||”}
- Don’t forget to fill out the Settings section below with importer title, etc.
On your local machine, make a folder of the files you want to import
- Filenames should be the same as identifiers in metadata sheet. This is how CA knows which files to attach to which metadata records
- Only the primary media representations should be in this folder. Put secondary files (e.g., scan of the back of a photograph) should be in a different folder. These must be added manually, as far as I know.
Upload the folder of items to import to pawtucket/admin/import.
- Perform chmod 744 to all items inside the folder once you’ve done this, otherwise you’ll get an “unknown media type” error later.
(Metadata import) In CA, go to Import → Data, upload the mapping template, and click the green arrow button. Select the metadata spreadsheet as the data format
- “Dry run” may actually import (bug in v. 1.4, resolved in later version?). So again, try this in dev first.
- Select “Debugging output” so if there’s an error, you’ll see what’s wrong
- This step creates objects that have their metadata all filled out, but no media representations.
- Imported successfully? Look everything over.
(Connect uploaded media to metadata records) In CA, go to Import → Select the directory from step 5.
- “Import all media, matching with existing records where possible.”
- “Create set ____ with imported media.”
- Put object status as inaccessible, media representation access as accessible — so that you have a chance to look everything over before it’s public. (As far as I know, it’s easy to batch-edit object access, but hard to batch-edit media access)
- On the next screen, CA will slowly import your items. Guesstimate 1.5 minutes for every item. Don’t navigate away from this screen.
Navigate to the set you just created and spot-check all items.
- Batch-edit all objects to accessible to public when satisfied
Add secondary representations manually where needed.

You may need to create multiple metadata spreadsheets and mapping templates if you’re importing a complex database. For instance, for trial transcripts that had multiple kinds of relationships with multiple entities, we just did 5 different metadata imports that tacked more metadata onto existing objects, rather than creating one monster metadata import.

You can switch steps 5 and 6 if you want, I believe, though since 5 is easy to look over and 6 takes a long time to do, I prefer my order.

Again, I urge you to try this on your dev instance of CA first (you should totally have a dev/test instance). And let me know if you want to know how to batch-delete items.

Good luck!

June 15, 2015

CollectiveAccess work environment

I wrote earlier about our CollectiveAccess workflow for uploading objects one-by-one and in a batch. Now I’ll share our CollectiveAccess work environment. We use two Ubuntu servers, development (test) and production (live), both with CollectiveAccess installed on them. We also use a private GitHub repository.

This is only one example of a CollectiveAccess workflow! See the user-created documentation for more.

Any changes to code (usually tweaking the layout of front end, Pawtucket) are made first on the dev instance. Once we’re happy with the changes and have tested out the site in different browsers, we commit & push the code to our private GitHub repository using Git commands on the command line. Then we pull it down to our production server, where the changes are now publicly viewable.

Any changes to objects (uploading or updating objects, collections, etc.) are made directly in the production instance. We never touch the database directly, only through the admin dashboard (Providence). These data changes aren’t done in the dev instance; we only have ~300 objects in the dev server, as more would take up too much room, and there’s no real reason why we should have all our objects on the dev instance. But if there’s a new filetype we’re uploading for the first time, or another reason an object might be funky, we add the object as a test object to the dev server.

Any changes to metadata display (adding a new field in the records) is done through the admin dashboard. I might first try the change on the dev instance, but not necessarily.

Pros of this configuration:

code changes aren’t live immediately and there is a structure for testing
all code changes can be reverted if they break the site
code change documentation is built into the workflow (Git)
objects and metadata are immediately visible to the public
faculty/staff working on the collections only don’t need to know anything about Git

Cons:

increasing mismatch between the dev and production instances’ objects and metadata display (in the future, we might do a batch import/upload if we need to)
this workflow has no contact with the CollectiveAccess GitHub, so updates aren’t simply pulled, but rather manually downloaded

Not pictured or mentioned above: our servers are backed up on a regular basis, on- and off-site; and anytime there’s a big code update, a snapshot is taken of the database.

CollectiveAccess super user? Add your workflow to the Sample Workflows page!

November 17, 2014

CollectiveAccess workflow

I’ve gotten a few emails lately from other library/archive organizations asking about how we use CollectiveAccess, open-source software for digital collections at museums, archives, and libraries. Our Digital Collections at John Jay launched earlier this year and runs on CollectiveAccess. We’re really happy with it! Since it’s designed for archival-style content from the get-go, there are a lot of really nice library-friendly touches.

For those considering CollectiveAccess, it might be helpful to see what it looks like to use the software. CollectiveAccess takes a good amount of elbow grease to get up and going (more than Omeka, for instance), but the workflows are pretty straightforward once your installation is stable.

Uploading objects to CollectiveAccess

So how exactly do you populate your CollectiveAccess site? First, I’ll define a few special words that CollectiveAccess uses:

object: the thing you digitized. E.g., a photograph, a book, a document. Our rule of thumb is that one physical object = one digital object. Each object is of one type…

object type: what category is the thing? This will affect what metadata fields you’ll fill in. For instance, our object type “Trial transcript” has fields for “court” and “stenographer’s number,” which only apply to this object type.

media representation: an uploaded file. One object can have multiple representations. A photograph-type object might have two media representations: scans of the front and back. Or an oral history might have a PDF and several audio clips.

collection: the conceptual group that contains objects. A collection can have multiple objects. Again, our rule of thumb is one physical collection in the archives = one digital collection. Makes it easy! Makes total sense! (Okay, sometimes we fudge a little.) See our list of collections in our Digital Collections.

Note: the workflows below are just how we use the software. Other places may differ. But it’s useful to see examples. This also assumes that you’re logged into the back end and your metadata schema are good to go.

Screenshot of CollectiveAccess, editing a single object (click for larger)

Our workflow for uploading objects one at a time

Example: we had student workers create the John Jay College Archives collection by scanning and inputting metadata, one thing at a time (reviewed later by librarians)

Click “New object” in CollectiveAccess, choosing appropriate object type
Write in metadata, either basic or complete, following your organization’s conventions
Upload object (can’t be done first, as uploaded item must have identifier to latch on to, assigned in step 2)
Review, then make publicly accessible

Template for data import in CollectiveAccess. This works in conjunction with another spreadsheet that has metadata related to cases on it. (Click to see larger image, or email me for more example templates)

Our workflow for batch uploads, when we already have all metadata and media files

Example: migrating files and metadata out of an old database, which is what we’re currently doing for our trial transcripts collection

Batch-upload metadata, using the filename as identifier
- data import for CA is complicated to understand at first, but once you get your spreadsheets and templates in order, it’s amazing and fast
- this step creates a bunch of objects that don’t have media files attached to them (they’re just records)
- you might have to do multiple data imports,to split up big data or because you have complicated data (e.g., we have lots of overlapping person data: defendants, judges, etc.)
Batch-upload files, matching on filename to existing objects. Takes a while
Review, then make publicly accessible

When you upload a file to CollectiveAccess, it can take a while because it creates a lot of derivatives. For example, one uploaded photo generates all these files:

Screenshot of derivative filenames from CollectiveAccess

It also stores the original file, though it’s up to you to decide which derivative you allow users to download, if any. Our users can view objects in high resolution (in a special image viewing frame) and download full PDFs, but can only download medium-size JPGs for images. For print quality-size images, a user must contact our Special Collections librarian. This ensures accurate citations.

NYC-area CollectiveAccess events

The CollectiveAccess software is made right here in the city! In September, the friendly CollectiveAccess developers led a workshop at METRO that walked us through configuring new CA installations and importing sample data. The workshop materials are still online and are incredibly useful in piecing together the data import process.

I’m the convener of the CollectiveAccess User Group here in NYC. Our next meeting is Monday, December 1, 2014 at 10am at METRO. We’ll get behind-the-scenes tours of CollectiveAccess installations at Brooklyn Navy Yard, Roundabout Theatre Company, Booklyn, and New York Society Library. The CA team attends User Group meetings, too, and is as helpful and responsive in person as they are in the support forums. If you’re interested in CollectiveAccess, register for free & join us at METRO!

December 13, 2013

Coming soon: Lloyd Sealy Library Digital Collections

From the 1922 NYPD Annual Report: Narcotics, firearms, and ammunition seized (by NYPD) in raid on headquarters of Hip Sing Tongs (Chinese-American criminal organization).

A version of this article was published in Classified Information (fall 2013), Lloyd Sealy Library’s biannual newsletter.

For the first time, the John Jay Library is consolidating its unique digital resources into one online, publicly-accessible collection. The Lloyd Sealy Library Digital Collections will launch in the spring 2014 semester as a premier repository for digitized criminal justice history materials. Researchers will find audio clips of Ed Koch speaking about subway crime, mug shots of notorious Murder, Inc. criminals, trial transcripts from 1920s New York murder cases, and much more in the coming collections

Research value

A sneak peek of our beta front page (likely to change quite a bit)

The Lloyd Sealy Library is well known for the strength of its criminal justice and social sciences collections. Under the leadership of Chief Librarian Larry Sullivan, formerly the Chief of the Rare Book and Special Collections Division at the Library of Congress, the Special Collections has grown particularly robust, providing valuable material for researchers of criminal justice history in New York City and around the world.

Since the turn of this century, the Library has put a great deal of effort into making these collections accessible online. The Crime in New York 1850-1950 project made available selected photographs from the Burton Turkus Papers and Lewis Lawes Papers, as well as hundreds of trial transcripts from the County of New York. The Library has also digitized nearly 100 rare books with the Internet Archive. In-house, we have made high-quality scans of items from the John Jay College Archives. For the first time, these digital materials will all be browsable, searchable, and downloadable in one place—in addition to brand new material.

Prof. Jeffrey Kroessler, our Circulation Librarian, is contributing his in-progress project, Justice in New York: An Oral History. With the generous support of John Jay supporter Jules Kroll, Prof. Kroessler— sometimes accompanied by Prof. Sullivan— has interviewed dozens of New York City’s leading figures in criminal justice, including former mayor Ed Koch and former police commissioner Patrick V. Murphy. These interviews, rich as both historical reference and anecdote, are a vibrant resource for researchers and passersby alike. In the spring, the full interview transcripts, along with audio clips, will be available online for the first time in the Digital Collections.

More digital research materials are also on the way, the most timely of which are selections from the John Jay College Archives. As the College nears its 50th anniversary in 2014–15, the Library will digitize and catalog more materials from the College’s history. The Archives measure 400 linear feet of records containing images of student life, news clippings, yearbooks, and more. Under the guidance of Interim Special Collections Librarian Ellen Sexton and Special Collections Librarian Ellen Belcher, and with support from other departments and offices at John Jay, a curated selection of materials from the Archives will be available in the Digital Collections.

Teaching with the Digital Collections

With more material available, the Digital Collections will be of high interest to researchers and fans of history—and also for teaching faculty. These rich online resources are an engaging and relevant gateway for students learning how to conduct research using primary sources. As the Library saw recently in the Murder Mystery Challenge, students can find great satisfaction diving into historical materials both gruesome (murder scene photographs) and enlightening (court case records). These materials give students the chance to grapple with the complexity and ambiguity of the historical record. Moreover, research today requires advanced digital literacy skills, and the Library strongly supports incorporating digital research in classroom assignments.

Technical details

The chosen content management system, CollectiveAccess, installed and customized in consultation with Openflows. CollectiveAccess provides robust search and browsing functionalities with a focus on thorough metadata. The Digital Collections will mirror the Special Collections, with each physical collection manifested as one digital collection. Many items will be freely downloadable, following the Library’s commitment to public knowledge.

Stay tuned

The Library is working daily to improve the system and load in more material. We plan to launch next semester— keep an eye out for the launch announcement! (Librarians & archivists in NYC, I’ll be presenting the Digital Collections in beta at METROcon in January 2014.)

April 25, 2013

Great digital collections site from Duke

My favorite digital collections site is from Duke University Libraries: library.duke.edu/digitalcollections/ Beautiful design and so easy to dive effortlessly into the content.

Home page (click to embiggen)

Item detail page (click for large)

I usually hate seeing social media on digital collections pages, because how disheartening to see 0 likes, 0 tweets, 0 comments on most of your items. But for promoted collections, like their Historic American Sheet Music, I was surprised and impressed to see some community participation.

I wonder how they assign subject headings. Some look like LCSH (abundance of dashes) and some seem more like tags. The subjects link to other items in their catalog, which is useful, although it takes you out of the Digital Collections site experience.

According to their About page, the application was built in-house using Django and a good number of tools and widgets. (I usually use BuiltWith to get a behind-the-scenes peek at a site’s architecture.)

Well done, Duke!

Emerging Tech in Libraries