Monday, March 21, 2016

Reimagining Libraries In The Digital Era: Lessons From Data Mining The Internet Archive



Reimagining Libraries In The Digital Era: Lessons From Data Mining The Internet Archive

As the digital revolution fundamentally reshapes how we live our lives, libraries are grappling with how to reinvent themselves in a world in which they are no longer a primary gatekeeper to knowledge. As I wrote in 2014 for the Knight Foundation’s blog, “perhaps the future of libraries lies in a return to their roots, not as museums of physical artifacts for rental, but as conveners of information and those who can understand and translate that information to the needs of an innovative world.” As the Knight Foundation wraps up their most recent Challenge on reinventing libraries for the 21st century (which has attracted over 225 submissions to date) and as the nation prepares for a new Librarian of Congress to shepherd the organization into the digital era, what might the future of libraries look like?

In terms of physical space, libraries are increasingly shifting away from physical repositories of knowledge artifacts, replacing endless rows of shelving units and open stacks with open floor plans designed for collaboration and work spaces. Free Internet access, maker spaces, robotics, technical classes and an increased focus on event programming are transforming libraries into 21st century community centers bringing people together in a data rich world. Yet, at the same time, libraries still hold the vast wealth of knowledge built up by human civilization over the millennia, much of which has not yet been digitized and is now being locked away in cold storage, more inaccessible than ever. In an era where a growing faction of digital information is commercially owned and controlled, libraries play a critical role in democratizing access to knowledge and ensuring a vibrant community of practitioners capable of brining this knowledge to bear on societal needs.

In 2014 I posed the question “What if we could bring scholars, citizens and journalists together, along with computers, digitization and ‘big data’ to reimagine libraries as centers of information innovation that help us make sense of the oceans of data confronting society today?” Reflecting back on three years of collaborating with the Internet Archive, my own experiences working with such a digital-first library offers a number of reflections and insights into the future of libraries as data-driven centers of innovation.

I first interacted with the Archive in 2008 when researching the landscape of book digitization to compare the commercial efforts of Google Books versus the library-driven Open Content Alliance, pushing the Archive to publish more of its code and technical designs openly and to collaborate with the open source community to allow others to both better understand the nuances of the Archive’s designs and to help it improve and evolve those designs. Two years later at a Library of Congress meeting on archiving the then-growing world of citizen journalism, a key question facing the archival community was how to tractably archive the vast blogosphere given its incredible growth and the difficulty of tracking new blogs as they come and go. At the time I noted that a small set of companies accounted for the majority of hosting platforms and tools used to publish blogs and suggested collaborating with those companies to receive a streaming API of all URLs of new blog posts on those platforms as they are published. Today that provides a key data stream to the Archive’s web crawlers.

In 2012 I gave the opening keynote address of the General Assembly of the International Internet Preservation Consortium (IIPC) at the Library of Congress where I made the case that web archiving initiatives should find ways of making their content available to scholars for academic research. Beyond simply preserving the web for future generations, library-based web archives like the Internet Archive’s Wayback Machine offer researchers one of the few places they can work with large collections of web content. While not as extensive as the collections held by commercial search engines, library archives are accessible to scholars who lack the resources to build their own massive web-scale crawling infrastructures and uniquely allow the exploration of change over time at the scale of the web itself.

In 2013 I began my first major analytic collaboration with the Internet Archive, creating an interactive visualization of the geography of the Archive’s Knight Foundation-supported Television News Archive to explore what parts of the world Americans hear about when they turn on their televisions, followed by an interactive search tool comparing coverage of different keywords. This involved applying sophisticated data mining algorithms to the closed captioning stream of each broadcast. Anticipating the needs of data miners, the Archive had built its systems in such a way that television was treated not as a collection of MPEG files gathering digital dust on a file server, but as a machine-friendly analytic environment, with closed captioning available as standard ASCII text and audio and video streams accessible in similar analytic-friendly formats.

Enabling secure access to this material required the development of what is today called the Archive’s “Virtual Reading Room” that allows secure cloud-based data mining of the Archive’s collections. Much like a traditional library archive reading room, in which patrons arrive, consult material of interest, and leave only with their notes, the Virtual Reading Room allows trusted researchers to bring data mining codes to execute directly on the Archive’s servers, with only computed metadata leaving the premises, while the original television content never leaves the Archive’s servers. Such a model offers a powerful vision for the future of commercial data providers to afford secure data mining access to their vast holdings.

This was followed by an effort to reimagine the concept of the book. Instead of thinking of books as containers of words, what if books were thought of as the world’s largest collection of imagery? Using the Internet Archive’s public collection of more than 600 million pages of public domain books dating back 500 years and contributed by over 1,000 libraries worldwide, the OCR of each book was used to extract every image from every page and gradually upload them all, along with their surrounding text, to Flickr.

The Archive’s machine-friendly books search engine, which can output CSV lists of all matching books optimized for bulk download, its ZIP API that makes it possible to instantly extract single page images, its inclusion of the raw Abbyy OCR XML file for each book, and its public interface that allows bulk download of its holdings, made it possible to do all this as a side project in my personal time. Specifically, the Archive’s decision to maximize the openness of its holdings and optimize all of its pages for machine access through JSON output, standardized interfaces, and friendly search engine output and APIs makes it possible for data miners like myself to use its holdings in creative ways. Indeed, often the best thing a library can do is to make its holdings as accessible as possible and just step back and get out of the way to let its community find new applications for all of that data.

Following in the footsteps of the television mapping project, in 2014 I assessed more than 2,200 emotions from each broadcast, measuring complex emotions like the intensity of “anxiety.” Such dimensions capture a fascinating look at American society such as the 2013 US Government shutdown in stark detail. Last month this was upgraded to a live feed that updates each day, making it possible to track changes in geography, topical distribution, emotional undercurrents and more each day.

In 2014 I also worked with the Archive to explore what it would look like to enable mass data mining over its more than 460 billion archived web pages. This involved overcoming a number of technical hurdles for dealing with such a large dataset and creating technical blueprint and workflow for others to follow. The final analysis examined the more than 1.7 billion PDF files preserved by the Archive since 1996, coupling them with JSTOR, DTIC and several other collections to data mine more than 21 billion words of academic literature codifying the scholarly knowledge of Africa and the Middle East over the last 70 years. Beyond being the first large-scale socio-cultural analysis of a web archive, it also has had a very real world impact, pioneering the use of large-scale data mining to socio-cultural research and helping incorporate understanding of non-Western cultures into development efforts.

As more and more academic institutions and non-profit organizations operate their own specialized crawling infrastructures focused on specific subsets of the web, library-based archives are in a unique position to collaboratively leverage the feeds of URLs generated by these projects to massively extend their research. In August 2014 the GDELT Project joined the Internet Archive’s “No More 404” program, providing it with a daily list of all global online news articles it monitors. Over just the last ten months of 2015, GDELT provided the Archive with the URLs of more than 200 million news articles, much of them representing the journalism of the non-Western world. More than 667,000 articles from across the world about Nepal and the 2015 earthquake were preserved through February of this year, including a quarter-million published in a language other than English, with the most common language being Nepali, archiving local perspectives and news as the country rebuilds.

As mass-scale web crawling becomes increasingly accessible, such partnerships could provide a powerful opportunity for web archives to extend their reach beyond their own in-house crawling infrastructures, especially deepening their collection of highly specialized content.

Last year I used Google’s cloud to process more than 3.5 million books totaling the complete English-language public domain holdings of the Internet Archive and HathiTrust dating from 1800 to 2015. This was used to publish the first comprehensive comparison of the two collections, with a number of fundamental findings regarding how the data we use influence the findings we derive. Perhaps most powerfully, the animation above shows the impact of copyright on data mining. In the United States, most books published after 1922 are still protected by copyright and thus cannot be digitized, leaving them behind the digital revolution and creating a paradox whereby we have more information on what happened in 1891 than we do about 1951.

Finally, given the intense interest in media coverage of the 2016 election cycle, I have been collaborating extensively with the Internet Archive around visualizing how the more than 100 American television stations it monitors are covering the various candidates. The Candidate Tracker visualization, which counts how many times each candidate is mentioned on television daily, has become a defacto standard, powering analyses from the Atlantic to the Daily Mail, the LA Times to the New York Times and the Washington Post to Wired. Over time this has been supplemented with a range of additional visualizations like word clouds and ngram CSV files. Working with the Archive’s own staff, who have been applying audio fingerprinting to their television streams, I created a series of visualizations with them of the early debates, allowing viewers to see which sound bites went viral in the television sphere, following this most recently by applying Google’s Cloud Vision API deep learning technology to analyze the imagery of campaign advertisements monitored by the Archive.
These campaign visualizations are actually a fascinating example of how libraries can collaborate with scholars to utilize their holdings to provide open access analyses for the public good about highly timely and nationally important issues.

Finally, as the Internet Archive begins looking towards creating full text search of portions of its web holdings, there are countless opportunities for it to collaborate with the research and visualization communities to think beyond simple keyword search towards the unique user interface problems of how to present all of that information in a meaningful way. Offering basic keyword search of very large text archives is a well understood problem in an era where Google alone maintains a 100 petabyte search index comprised of over 30 trillion rows and 200 variables, growing at a rate of tens of billions of rows per day and returning searches in less than an eighth of a second. The real challenge is about how to assess relevance in an temporal index and how to visualize search results that stem from how a page changes over time. Rather than try to build such systems entirely in-house, as libraries increasingly open their digital collections to access, they should reach out to the open source and research communities to think outside the box for creative solutions to these challenges.

As my collaborations over the last three years with the Internet Archive illustrate, mining its web, books and television holdings, the digital era offers an incredible opportunity for libraries to reinvent themselves as a data-rich nexus of innovation that unlocks the vast wealth of societal knowledge they hold and to bring together scholars, citizens and journalists to use all of this data to reimagine how we understand our global world.

Source | http://www.forbes.com/

Regards

Pralhad Jadhav
Senior Librarian
Khaitan & Co

Upcoming Event | National Conference on Future Librarianship: Innovation for Excellence (NCFL 2016) during April 22-23, 2016.

Note | If anybody use these post for forwarding in any social media coverage or covering in the Newsletter please give due credit to those who are taking efforts for the same.

No comments:

Post a Comment