Unlocking the Potential of Digital Archives
Last night Jim Mussell posted an excellent review of the British Newspaper Archive on his blog. He makes a number of really important points that I skirted over in my own review. I recommend reading Jim’s post in its entirety. However, one of his arguments is particularly worth emphasizing:
This leads me to my second point: the way brightsolid have digitized this material also restricts possible uses. This is a resource for finding articles, not reading newspapers, and this is done by brightsolid’s search engine and database on the user’s behalf. There is no scope here for data mining, for analysis of textual transcripts, or for the interrogation of metadata. This actually runs counter to the dominant trend within both the digital humanities and commercial digital publishing, making BNA seem a little old fashioned. Gale Cengage’s NCCO, for instance, allows users to carry out rudimentary data mining. This is no mere moan about the way the project was executed. Taking advantage of the digital properties of digitized materials is the way in which we learn new things about them. Locking the data away means that users are stuck with old methodologies, treating the articles as if they were printed paper even though they clearly aren’t….
… There is no chance for any of this content to enter digital culture, becoming recontextualized as it interacts with other content; instead, it is trapped within the interface, pretending that it is paper, so users can read articles, one after the other. On these terms, it must be said, the BNA is excellent (and let me repeat, the page viewer is one of the best I have seen); but as a resource that contributes to the UK economy, scholarship, or even one that helps us learn more about nineteenth-century print culture, it is limited.
I can’t even begin to stress how important this is. The practical benefits of digitisation are well recognised. Improvements in speed, access, volume, and convenience are routinely celebrated. When asked to describe how digital archives have changed their lives, many historians highlight the fact that they no longer have to visit the British Library whenever they want to consult a newspaper. Others rejoice that their lives are no longer blighted by malfunctioning microfilm readers. Keyword search engines are widely recognised as a time saving device; a handy tool which helps researchers to find material quicker than by hand. So far, in other words, digitisation has largely been treated as a practical revolution – it has made research faster, easier, more convenient, and more productive.
These practical improvements are welcome, but digitisation is capable of so much more. It has the potential not just to change the day-to-day practice of research, but to fundamentally alter the kind of research that we are able do. Used creatively, it allows us to access and explore past cultures and societies in powerful new ways; to ask new questions, make new connections, construct new arguments, explore new topics, and re-examine old ones from new perspectives. It allows us to imagine new kinds of research.
In order to unlock these new methodological possibilities we need to be able to take full advantage of what Jim terms the “digital properties of digitized materials”. Researchers in the digital humanities have already started to do this with other archives of nineteenth-century print culture. Dan Cohen and Fred Gibbs have been text mining millions of titles of nineteenth-century books in order to explore changes in the Victorian frame of mind. A team of Harvard scientists have recently given this particular brand of the Digital Humanities the name of ‘culturomics’. In their study, they text-mined a corpus of 5 million digitised books and quantified the evolution of grammar, the speed at which society forgets its past, the adoption of new technologies, the effects of censorship, and the changing nature of fame. Best of all, this project inspired the creation of Google’s Ngram Viewer – a publicly accessible tool for plotting the frequency of words in the Google Books archive.
This research is still in its embryonic stage, but it hints at future possibilities. Unfortunately, we are currently unable to interrogate nineteenth-century newspaper archives with the same freedom and creativity. The raw materials are all in place – sources have been digitised and marked up with usable metadata – but the interfaces don’t allow us to ask the right questions. They’re designed for one, very basic form of digital research: keyword searches that lead to close reading.
If we want to do anything more ambitious, we need to design new interfaces. Recent projects like Connected Histories and Locating London’s Past are great examples of how this can work. Both websites allow researchers to explore existing archives in new ways. It is now possible, for example, to plot cases from the Old Bailey Online archive onto an 18th century map of London.
This is where the key problem with the BNA arises. By giving control of the archive to Brightsolid and allowing them to put it behind a paywall, the British Library have prevented researchers from developing similarly innovative new ways of exploring its data. Without the freedom to develop new interfaces, we lose the power to frame new questions. Without the power to frame new questions, we won’t be able to find new answers. The potential of digitisation to reveal new insights into the past will be squandered.
The good news is that it’s not too late to fix these problems. The data is there to be reused, if its ‘owners’ will allow us. As Jim argues:
One can only hope that the British Library does not now consider this material ‘done’, It is essential that they recognize that this is one possible implementation, one possible representation of this content amongst many others, and so should be open to other uses of the data – whether transcripts, page images, or metadata – that might come along in the future.