Reimagining Libraries In The Digital Era: Lessons From Data Mining The Internet Archive
As the digital revolution fundamentally
reshapes how we live our lives, libraries are grappling with how to reinvent
themselves in a world in which they are no longer a primary gatekeeper to
knowledge. As I wrote in 2014 for the Knight Foundation’s blog, “perhaps the future
of libraries lies in a return to their roots, not as museums of physical
artifacts for rental, but as conveners of information and those who can
understand and translate that information to the needs of an innovative world.”
As the Knight Foundation wraps up their most recent Challenge on reinventing
libraries for the 21st century (which has attracted over 225 submissions to date) and as
the nation prepares for a new Librarian of Congress to shepherd the
organization into the digital era, what might the future of libraries look
like?
In terms of physical space, libraries are increasingly
shifting away from physical repositories of knowledge artifacts, replacing
endless rows of shelving units and open stacks with open floor plans designed
for collaboration and work spaces. Free Internet access, maker spaces,
robotics, technical classes and an increased focus on event programming are
transforming libraries into 21st century community centers bringing people
together in a data rich world. Yet, at the same time, libraries still hold the
vast wealth of knowledge built up by human civilization over the millennia,
much of which has not yet been digitized and is now being locked away in cold
storage, more inaccessible than ever. In an era where a growing faction of
digital information is commercially owned and controlled, libraries play a critical
role in democratizing access to knowledge and ensuring a vibrant community of
practitioners capable of brining this knowledge to bear on societal needs.
In 2014 I posed the question “What if we could
bring scholars, citizens and journalists together, along with computers,
digitization and ‘big data’ to reimagine libraries as centers of information
innovation that help us make sense of the oceans of data confronting society
today?” Reflecting back on three years of collaborating with the Internet
Archive, my own experiences working with such a digital-first library offers a
number of reflections and insights into the future of libraries as data-driven
centers of innovation.
I first interacted with the Archive in 2008
when researching the landscape
of book digitization to compare the commercial efforts of Google Books versus
the library-driven Open Content Alliance, pushing the Archive to publish more
of its code and technical designs openly and to collaborate with the open source
community to allow others to both better understand the nuances of the
Archive’s designs and to help it improve and evolve those designs. Two years
later at a Library of Congress meeting on archiving the then-growing world of
citizen journalism, a key question facing the archival community was how to
tractably archive the vast blogosphere given its incredible growth and the
difficulty of tracking new blogs as they come and go. At the time I noted that
a small set of companies accounted for the majority of hosting platforms and
tools used to publish blogs and suggested collaborating with those companies to
receive a streaming API of all URLs of new blog posts on those platforms as
they are published. Today that provides a key data stream to the Archive’s web
crawlers.
In 2012 I gave the opening keynote address of the General
Assembly of the International Internet Preservation Consortium (IIPC) at the
Library of Congress where I made the case that web archiving initiatives should
find ways of making their content available to scholars for academic research.
Beyond simply preserving the web for future generations, library-based web
archives like the Internet Archive’s Wayback Machine
offer researchers one of the few places they can work with large collections of
web content. While not as extensive as the collections held by commercial
search engines, library archives are accessible to scholars who lack the
resources to build their own massive web-scale crawling infrastructures and
uniquely allow the exploration of change over time at the scale of the web
itself.
In 2013 I began my first major analytic
collaboration with the Internet Archive, creating an interactive visualization of the
geography of the Archive’s Knight Foundation-supported Television News Archive
to explore what parts of the world Americans hear about when they turn on their
televisions, followed by an interactive search tool
comparing coverage of different keywords. This involved applying sophisticated
data mining algorithms to the closed captioning stream of each broadcast.
Anticipating the needs of data miners, the Archive had built its systems in
such a way that television was treated not as a collection of MPEG files gathering
digital dust on a file server, but as a machine-friendly analytic environment,
with closed captioning available as standard ASCII text and audio and video
streams accessible in similar analytic-friendly formats.
Enabling secure access to this
material required the development of what is today called the Archive’s “Virtual Reading Room” that
allows secure cloud-based data mining of the Archive’s collections. Much like a
traditional library archive reading room, in which patrons arrive, consult
material of interest, and leave only with their notes, the Virtual Reading Room
allows trusted researchers to bring data mining codes to execute directly on
the Archive’s servers, with only computed metadata leaving the premises, while
the original television content never leaves the Archive’s servers. Such a
model offers a powerful vision for the future of commercial data providers to
afford secure data mining access to their vast holdings.
This was followed by an effort to reimagine
the concept of the book. Instead of thinking of books as containers of words,
what if books were thought of as the world’s largest collection of imagery? Using
the Internet Archive’s public collection of more than 600 million pages of
public domain books dating back 500 years and contributed by over 1,000
libraries worldwide, the OCR of each book was used to extract every image from
every page and gradually upload them all, along with their surrounding text, to
Flickr.
The Archive’s machine-friendly books search
engine, which can output CSV lists of all matching books optimized for
bulk download, its ZIP API that makes it possible to instantly extract single
page images, its inclusion of the raw Abbyy OCR XML file for each book, and its
public interface that allows bulk download of its holdings, made it possible to
do all this as a side project in my personal time. Specifically, the Archive’s
decision to maximize the openness of its holdings and optimize all of its pages
for machine access through JSON output, standardized interfaces, and friendly
search engine output and APIs makes it possible for data miners like myself to
use its holdings in creative ways. Indeed, often the best thing a library can
do is to make its holdings as accessible as possible and just step back and get
out of the way to let its community find new applications for all of that data.
Following in the footsteps of the television
mapping project, in 2014 I assessed more than 2,200
emotions from each broadcast, measuring complex emotions like the intensity of
“anxiety.” Such dimensions
capture a fascinating look at American society
such as the 2013 US Government shutdown in stark detail. Last month this was
upgraded to a live feed that updates each day,
making it possible to track changes in geography, topical distribution,
emotional undercurrents and more each day.
In 2014 I also worked with the Archive to
explore what it would look like to enable mass data mining over its more than
460 billion archived web pages. This involved overcoming a number of technical
hurdles for dealing with such a large dataset and creating technical blueprint
and workflow for others to follow. The final analysis examined the more than 1.7
billion PDF files preserved by the Archive since 1996, coupling them with
JSTOR, DTIC and several other collections to data mine more than 21 billion
words of academic literature codifying the scholarly
knowledge of Africa and the Middle East over the last 70 years. Beyond being
the first large-scale socio-cultural analysis of a web archive, it also has had
a very real world impact, pioneering the use of large-scale data mining to
socio-cultural research and helping incorporate understanding of non-Western
cultures into development efforts.
As more and more academic institutions and
non-profit organizations operate their own specialized crawling infrastructures
focused on specific subsets of the web, library-based archives are in a unique
position to collaboratively leverage the feeds of URLs generated by these
projects to massively extend their research. In August 2014 the GDELT
Project joined the Internet
Archive’s “No More 404” program, providing it with a daily
list of all global online news articles it monitors. Over just the last ten
months of 2015, GDELT provided the Archive with the URLs of more than 200
million news articles, much of them representing the journalism of the
non-Western world. More than 667,000 articles from across the world about Nepal
and the 2015 earthquake were preserved through February of this year, including
a quarter-million published in a language other than English, with the most
common language being Nepali, archiving local perspectives and news as the
country rebuilds.
As mass-scale web crawling becomes
increasingly accessible, such partnerships could provide a powerful opportunity
for web archives to extend their reach beyond their own in-house crawling
infrastructures, especially deepening their collection of highly specialized
content.
Last year I used Google’s cloud to process more than 3.5
million books totaling the complete English-language public domain holdings of
the Internet Archive and HathiTrust dating from 1800 to 2015. This was used to
publish the first comprehensive comparison of the two
collections, with a number of fundamental findings regarding how the
data we use influence the findings we derive. Perhaps most powerfully, the
animation above shows the impact of copyright on data mining. In the United
States, most books published after 1922 are still protected by copyright and
thus cannot be digitized, leaving them behind the digital revolution and
creating a paradox whereby we have more information on what happened in 1891
than we do about 1951.
Finally, given the intense interest in media
coverage of the 2016 election cycle, I have been collaborating extensively with
the Internet Archive around visualizing how the more than 100 American
television stations it monitors are covering the various candidates. The Candidate Tracker
visualization, which counts how many times each candidate is mentioned on
television daily, has become a defacto standard, powering analyses from the
Atlantic to the Daily Mail, the LA Times to the New York Times and the
Washington Post to Wired. Over time this has been supplemented with a range of
additional visualizations like word clouds and ngram CSV
files. Working with the Archive’s own staff, who have been applying audio
fingerprinting to their television streams, I created a series of visualizations with them of
the early debates, allowing viewers to see which sound bites went viral in the
television sphere, following this most recently by applying Google’s Cloud
Vision API deep learning technology to analyze the imagery of
campaign advertisements monitored by the Archive.
These campaign visualizations are actually a
fascinating example of how libraries can collaborate with scholars to utilize
their holdings to provide open access analyses for the public good about highly
timely and nationally important issues.
Finally, as the Internet Archive begins
looking towards creating full text search of portions of its web holdings,
there are countless opportunities for it to collaborate with the research and
visualization communities to think beyond simple keyword search towards the
unique user interface problems of how to present all of that information in a
meaningful way. Offering basic keyword search of very large text archives is a
well understood problem in an era where Google alone maintains a 100 petabyte
search index comprised of over 30 trillion rows and 200 variables, growing at a
rate of tens of billions of rows per day and returning searches in less than an
eighth of a second. The real challenge is about how to assess relevance in an
temporal index and how to visualize search results that stem from how a page
changes over time. Rather than try to build such systems entirely in-house, as
libraries increasingly open their digital collections to access, they should
reach out to the open source and research communities to think outside the box
for creative solutions to these challenges.
As my collaborations over the last three
years with the Internet Archive illustrate, mining its web, books and
television holdings, the digital era offers an incredible opportunity for
libraries to reinvent themselves as a data-rich nexus of innovation that
unlocks the vast wealth of societal knowledge they hold and to bring together
scholars, citizens and journalists to use all of this data to reimagine how we
understand our global world.
Source | http://www.forbes.com/
Regards
Pralhad Jadhav
Senior Librarian
Khaitan & Co
Upcoming Event |
National Conference on Future Librarianship: Innovation for Excellence (NCFL
2016) during April 22-23, 2016.
Note
| If anybody use these post for forwarding in any social media coverage or
covering in the Newsletter please give due credit to those who are taking
efforts for the same.
No comments:
Post a Comment