General

Category: General

What did I do this year? 2015–16 edition

word cloud

I jump-start my annual self-evaluation process with a low-level text analysis of my work log, essentially composed of “done” and “to do” bullet points. I normalized the text (e.g. emailed to email and Digital Collections to digitalcollections), removed personal names, and ran the all “done” items through Wordle.

2015–16 was my fourth year in my job and the fourth time I did this. (See 2012–13, 2013–14, and 2014–15). I do this because it can be difficult to remember what I was up to many months ago. It’s also a basic visualization of where my time is spent. The more an item is mentioned, the more days I worked on it for at least an hour or so. (Which may be misleading — I think I spent more hours on teaching and prep for teaching, but because I staffed the ref desk for a short shift on more days, “refdesk” appears larger.)

What did I do at my job this year?

  • teaching: I taught a library instruction session (“one-shot”) in 16 on-campus classes and 5 online classes. I taught twice as many classes as I did last year.
  • embedded: For each online class, I was embedded in the Blackboard course for a full week, posting material and answering questions in the forum. (See my post about this.)
  • socialmedia: I post things on Twitter and Facebook, along with several colleagues. I run the Instagram account all by me onesy.
  • bssl: I published four columns in Behavioral & Social Sciences Librarian. (See all.)
  • murdermystery: I ran this super-fun activity in the spring semester and began preparing for a summer session.
  • refdesk: I staffed the Reference Desk for 105 hours.
  • chat: I staffed chat and SMS for 77 hours. (I replaced “chatted with X” with “meeting” to disambiguate.)
  • drupal, digitalcollections, and onesearch: Worked on web stuff that runs the library websites.

I also emailed a lot and had a lot of meetings. What’s also interesting is how much I used the word prep. This is often related in the work log to teaching, and I did teach double the number of classes I taught last year. But I think it also reflects an improvement in my time management skills!

I am also trying really hard to carve out more time for reading. At work, I mostly read articles and blogs related to the intersection of technology and library practices, with a healthy dose of DH and privacy activism.

Invisible spam pages on our website: how we locked out a hacker

TL;DR: A hacker uploaded a fake JPG file containing PHP code that generated “invisible” spam blog posts on our website. To avoid this happening to you, block inactive accounts in Drupal and monitor Google Search Console reports.

I noticed something odd on the library website the other day: a search of our site displayed a ton of spam in the Google Custom Search Engine (CSE) results.

google CSE spam

But when I clicked on the links for those supposed blog posts, I’d get a 404 Page Not Found error. It was like these spammy blog posts didn’t seem to exist except for in search results. I thought this was some kind of fake-URL generation visible just in the CSE (similar to fake referral URLs in Analytics), but regular Google was seeing these spammy blog posts as being on our site as well if I searched for an exact title.

spam results on google after searching for exact spam title

Still, Google was “seeing” these blog posts that kept netting 404 errors. I looked at the cached page, however, and saw that Google had indexed what looked like an actual page on our site, complete with the menu options.

cached page displaying spam text next to actual site text

Cloaked URLs

Not knowing much more, I had to assume that there were two versions of these spam blog posts: the ones humans saw when they clicked on a link, and the ones that Google saw when its bots indexed the page. After some light research, I found that this is called “cloaking.” Google does not like this, and I eventually received an email from Webmaster Tools with the subject “Hacked content detected.”

It was at this point that we alerted the IT department at our college to let them know there was a problem and that we were working on it (we run our own servers).

Finding the point of entry

Now I had to figure out if there was actually content being injected into our site. Nothing about the website looked different, and Drupal did not list any new pages, but someone was posting invisible content, purely to show up in Google’s search results and build some kind of network of spam content. Another suspicious thing: these URLs contained /blogs/, but our actual blog posts have URLs with /blog/, suggesting spoofed content. In Drupal, I looked at all the reports and logs I could find. Under the People menu, I noticed that 1 week ago, someone had signed into the site with a username for a former consultant who hadn’t worked on the site in two years.

Inactive account had signed in 1 week, 4 days ago

Yikes. So it looks like someone had hacked into an old, inactive admin account. I emailed our consultant and asked if they’d happened to sign in, and they replied Nope, and added that they didn’t even like Nikes. Hmm.

So I blocked that account, as well as accounts that hadn’t been used within the past year. I also reset everyone’s passwords and recommended they follow my tips for building a memorable and hard-to-hack password.

Clues from Google Search Console

The spammy content was still online. Just as I was investigating the problem, I got this mysterious message in my inbox from Google Search Console (SC). Background: In SC, site owners can set preferences for how their site appears in Google search results and track things like how many other websites like to their website. There’s no ability to change the content; it’s mostly a monitoring site.

reconsideration request from google

I didn’t write that reconsideration request. Neither did our webmaster, Mandy, or anybody who would have access to the Search Console. Lo and behold, the hacker had claimed site ownership in the Search Console:

madlife520 is listed as a site owner in google search console

Now our hacker had a name: Madlife520. (Cool username, bro!) And they’d signed up for SC, probably because they wanted stats for how well their spam posts were doing and to reassure Google that the content was legit.

But Search Console wouldn’t let me un-verify Madlife520 as a site owner. To be a verified site owner, you can upload a special HTML file they provide to your website, with the idea that only a true site owner would be able to do that.

google alert: cannot un-verify as HTML file is still there. FTP client window: HTML file is NOT there.

But here’s where I felt truly crazy. Google said Madlife520’s verification file was still online. But we couldn’t find it! The only verification file was mine (ending in c12.html, not fd1.html). Another invisible file. What was going on? Why couldn’t we see what Google could see?

Finding malicious code

Geng, our whipsmart systems manager, did a full-text search of the files on our server and found the text string google4a4…fd1.html in the contents of a JPG file in …/private/default_images/. Yep, not the actual HTML file itself, but a line in a JPG file. Files in /private/ are usually images uploaded to our slideshow or syllabi that professors send through our schedule-a-class webform — files submitted through Drupal, not uploaded directly to the server.

So it looks like this: Madlife520 had logged into Drupal with an inactive account and uploaded a text file with a .JPG extension to a module or form (not sure where yet). This text file contained PHP code that dictated that if Google or other search engines asked for the URL of these spam blog posts, the site would serve up spammy content from another website; if a person clicked on that URL, it would display a 404 Page Not Found page. Moreover, this PHP code spoofed the Google Search Console verification file, making Google think it was there when it actually wasn’t. All of this was done very subtly — aside from weird search results, nothing on the site looked or felt differently, probably in the hope that we wouldn’t notice anything unusual so the spam could stay up for as long as possible.

Steps taken to lock out the hacker

Geng saved a local file of the PHP code, then deleted it from the server. He also made the subdirectory they were in read-only. Mandy, our webmaster, installed the Honeypot module in Drupal, which adds an invisible “URL: ___” field to all webforms that bots will keep trying to fill without ever successfully logging in or submitting a form, in case that might prevent password-cracking software. On my end, I blocked all inactive Drupal accounts, reset all passwords, unverified Madlife520 from Search Console, and blocked IPs that had attempted to access our site a suspiciously high number of times (these IPs were all in a block located in the Netherlands, oddly).

At this point, Google is still suspicious of our site:

"This site may be hacked" warning beneath Lloyd Sealy Library search result

But I submitted a Reconsideration Request through Search Console — this time, actually written by me.

And it seems that the spammy content is no longer accessible, and we’re seeing far fewer link clicks on our website than before these actions.

marked increase, then decrease in clicked links to our site

I’m happy that we were able to curb the spam and (we hope) lock out the hacker in just over a week, all during winter break when our legitimate traffic is low. We’re continuing to monitor all the pulse points of our site, since we don’t know for sure there isn’t other malicious code somewhere.

I posted this in case someone, somewhere, is in their office on a Friday at 5pm, frantically googling invisible posts drupal spam urls 404??? like I was. If you are, good luck!

Update to ‘Using Instagram for your library’

library instagram

Heads up: I revised ‘Using Instagram for your library‘ to add a 4 more tips. We’re heavy Instagram users now at Lloyd Sealy Library; we post 2+ times a week when school is in session; we geotag and hashtag each post; we know the other IGers on campus; and we take the time to like/comment on other organizations’ posts, and even students’ posts, with the result of gaining followers and goodwill.

Since that post was originally written, we gained 5x the number of followers we originally had. (From 40 to 200. Small potatoes, compared to our students, but still.)

Moreover, I informally surveyed freshmen throughout the last semester. All of them are on Instagram all the time. And all of them laughed when I asked if they used Facebook. (They don’t.) So IG is where it’s at.

Rethinking what I do as an emerging tech librarian

I get asked often what it is I do, exactly, and I still don’t have my elevator pitch down pat. I usually choose to view that as a good thing, because I value the freedom to explore new territories and direct my own projects. However, a few recent reads have given me pause, and I’m rethinking my approach to my job as the new semester looms.

But first, a general look: what are other emerging tech librarians doing?

Duties & skills

At the end of April, IFLA published a paper by Tara Radniecki titled “Study on emerging technologies librarians: how a new library position and its competencies are evolving to meet the technology and information needs of libraries and their patrons.” It’s a quantitative attempt to answer the question of what people with that title do, know, and wish they knew. Radniecki compares job ads to survey responses with interesting results. A few insights from her paper, which is certainly worth a read, interspersed with my unsolicited personal opinions:

  • 41.8% of job ads cite reference as a duty; 72% of librarians reported doing reference work. Similarly, 38.8% of job ads cite info literacy and instruction as a duty; 61% of librarians reported doing instructional work.
    • This aligns with my own experience as an ETL. I never took a reference or instruction class for my MLIS, assuming that I wouldn’t be doing traditional librarianship. Although my job’s description didn’t mention reference/instruction, I’m at the reference desk a few hours a week, and in fact I really value it — otherwise I might not work with students face-to-face at all, which would be a terrible place to start when designing systems and interfaces for them.
  • 23.9% of job ads required skills/experience in social media, web 2.0, and outreach.
    • I’m surprised it’s not a higher percentage. I’m of the opinion that all librarians ought to be familiar with social media. Look, this time last year, I would have rolled my eyes at this — it’s such an old buzzword by now — and yet I’ve seen too many academic people/departments fail to grasp online social etiquette.
  • 94% of librarians surveyed said they needed additional skills in computer programming and coding, versus only 16.4% of job ads requiring those skills and 10.4% preferring them.
    • Yes. I came into my position with years of web experience but minimal programming skills. One thing I’ve enjoyed is the impetus to get up to speed on web tech, Python, and code libraries I might not have otherwise had reason to, but it can be rough playing catch-up so much of the time.

Still, looking at numbers and general duties, it’s hard to see what ETLs do. As for me, some of my projects are the usual deliverables — design/upgrade/maintain library website, for example — and others are more playful, like tinkering with Arduino. (You can see a quick summary of my first year’s projects here.) So what exactly are others up to? I suppose that’s what discussion groups, online networks, and excitable conversations are for. And in fact, CUNY librarians, the first Emerging Tech Committee meeting of the year is coming up! 

Tech criticism

“Librarian Shipwreck” wrote a long-ish response to Radniecki’s paper in a July blog post entitled, “Will Technological Critique Emerge with Emerging Technology Librarians?” (h/t Patrick Williams). The post is pretty prickly, and sometimes unfair, and its metaphors abound — but here’s one choice quote that I’m 100% behind (after the first clause):

But for the ETL to have any value other than as another soon-to-crumble-levy against the technological tide, what is essential is that the position develop a critical stance towards technology. A stance that will benefit libraries, those libraries serve, and hopefully have an effect on the larger societal conversation about technology. […] A critical stance towards technology allows for librarians to act as good stewards not only of public resources but also of public trust.

This summer has thrown the imperative of technology criticism into sharp relief for me. At the EMDA Institute, we spent a few days examining how EEBO came to be, perhaps best described in an excellent upcoming JASIST article by Bonnie Mak, “Archaeology of a Digitization.” As a room well-populated with early modern scholars, we thought we knew what it was, but after talking about its invisible labor (TCP), obscured histories (war, microfilm), and problematic remediations, we were struck by how EEBO is just one relatively benign instance of a black box we might use daily. What does it take to really know where your resource comes from, and why it is the way it is?

Moreover, the summer of surveillance leaks is exacerbating our collective ignorance of and anxieties about the technologies we rely on every day. I suspect few of us are changing our web habits at all, despite the furor. I talk about it daily, yet I struggle to understand what’s really going on — let alone how I should respond professionally. Originally, I viewed my function at John Jay as a maker, creator, connector of technologies. But technology is never neutral, and I’m starting to see pause and critique as part of my charge, too.

Digital literacy

When our digital world works, it’s beautiful. When it doesn’t, it’s a black box — a black hole — that can frustrate conspicuously or break invisibly. How can we, for instance, impart to students the level of digital literacy required to understand how a personalized search engine works and how it might fail them, when it so frictionlessly serves up top results they’ll use without scrolling below the fold? Marc Scott penned a popular July blog post titled “Kids can’t use computers… and this is why it should worry you.” I won’t summarize — you should read it. Twice.

Librarians have taught students information literacy for a long time. While knowing how to spot a scholarly resource from a dud is essential, understanding systems is now equally necessary, but harder to teach — and harder to grasp, too.

______________________

So there’s where I am: trying to understand the ethics of being an emerging technologies librarian; overplanning making things and delivering on projects; prioritizing instructing students, colleagues, and peers about things I struggle to comprehend or explain; simultaneously expecting myself to embody a defensive paranoia and a wild exploratory spirit.

Algorithms and academic research

Last Friday, I attended Computers and Crowds: Unexpected Authors and Their Impact on Scholarly Research at the CUNY Graduate School of Journalism, an excellent event organized by the LACUNY Emerging Tech Committee, LILAC, and OLS.

My notes are available as a very messy PDF of my scribbles made with Paper (iPad app). Another version of the presentation slides by Paul Zenke and Kate Peterson is also online under the title “Black Hats, Farms, and Bubbles.”

Impressions, connections, and resolutions:

  • What’s a filter bubble? As a web algorithm learns what you are more interested in, you are given more of what you tend to like. It’s a positive feedback loop. The downside of this is that you get less exposure to material that makes you uncomfortable or challenges your preconceptions/politics.
  • One of the library’s roles may be providing a place of neutrality. We can better provide neutral information for our users by installing tools that increase user privacy and decrease tracking, especially if these might be inconvenient or undesirable to use at home.
  • Some practices to protect yourself and your students from unwanted tracking:
    • clear your history and cookies regularly
    • use ad blocking software
    • see who’s tracking you using Collusion (Chrome & Firefox plugin)
    • use private browsing
    • understand how to de-personalize your Google search results
    • try out alternatives like Duck Duck Go
  • Challenge students to evaluate not just the resource, but to evaluate the algorithms that led them there.
    • Why might one article rise to the top of the results list using Google or an academic database?
    • How would they design a system to recommend material to a friend?
  • Challenge yourself to understand and compare these algorithms and filters. Do the leg work and the research to ensure you’re providing your students with acceptable platforms for information hunting, consumption, and creation.
    • For example, if you use Primo, familiarize yourself with ScholarRank
  • Algorithm-created content is already here. Narrative Science is hugely successful. NLP and text mining are changing journalism and are on their way to changing academic writing as well.
    • Algorithmically-created essays might be the next cheating trend. I have heard of online education programs (MOOCs, probably) asking students for a portfolio of past writing to algorithmically ascertain whether their writing is theirs or not by stylometric analysis
    • See also: “The Great Automatic Grammatizator,” a prescient short story by Roald Dahl