Jun 6 2012

First Look: Nineteenth Century Collections Online

It’s been nearly ten years since the launch of Eighteenth Century Collections Online [ECCO]. This ambitious project aimed to digitise “every significant English-language and foreign-language title printed in Great Britain during the  eighteenth century, along with thousands of important works from the Americas.” The definition of a ‘significant’ text remains open to interpretation, but the contents of the archive are undeniably impressive – in its present form it contains more than 180,000 titles. The unparalleled breadth of its coverage – along with the number of university libraries that took up subscriptions – quickly established it as a key focal point for the researching and teaching of eighteenth-century history.In other words, it’s a tough act to follow.

Enter Nineteenth Century Collections Online [NCCO]. This recently launched project follows in the footsteps of its eighteenth-century predecessor and, in the words of its publisher Gale Cengage, aims to be “the most ambitious scholarly digitisation and publication program ever undertaken.” The archive will contain millions of pages of nineteenth-century books, periodicals, diaries, letters, manuscripts, photographs, government records, pamphlets, and maps. More interestingly, it promises researchers the opportunity to subject these sources to some interesting new forms of qualitative and quantitative analysis. I’ve spent the last few days playing around with a trial version and, whilst it’s too soon to write a full review, I have a few preliminary thoughts on how it’s shaping up.

Content

 

NCCO contains such an eclectic range of sources that it’s difficult at first to get a handle on all of its contents. In fact, it makes more sense to think of NCCO as a customisable research platform that houses a series of themed archives. By the looks of things, it’ll be possible for libraries to select which archives they want to subscribe to.  At present, three  archives are available, each of which contains a series of sub-collections:

  1. Asia and the West: Diplomacy and Cultural Exchange
    British Foreign Office correspondence on Japan; dispatches and records from U.S. consuls in various Asian territories; missionary correspondence and journals; periodicals on Asian culture and society.
  2. British Politics and Society
    British Cabinet Papers, 1880-1916; British Labour History Ephemera; British Trade Union History Collection; Civil Disturbance, Chartism and Riots in Nineteenth Century England; Colonial Defence Commission under Lord Carnarvon; Diaries of Sir Frederick Madden; Discontent and Authority, 1820-1840; Transactions of the Manchester Statistic Society I & II; Home Office papers, records, and correspondence; the Police Gazette, 1828-1845; Ordnance Survey Drawings, 1789-1840; Papers and Correspondence of Charles James Fox, 1749-1806; Papers of Sir Robert Peel; Working Class Autobiographies; papers relating to Radicalism, Anti-Radicalism and Reform, 1769-1861; ephemera relating to British social and working conditions, politics, and economics, 1770s-1850s; the papers of John Cam Hobhouse, 1809-1869; rare freethought militant 19th century books; rare radical and labour periodicals; letters relating to the Jack the Ripper killings; books, pamphlets, and periodicals relating to working-class politics.
  3. European Literature, 1790-1840: The Corvey Collection
    A  collection of rare English (3,250 works), French (3,658) and German (2,653) Romantic-era writing.

A fourth collection, ‘British Theatre, Music, and Literature: High and Popular Culture’, will be released soon. It promises to contain a range of playbills, scripts, scores, and other pieces of theatrical ephemera. Presumably, if the product is successful, a steady stream of new archives will be announced over the coming years.

It’s hard to review such a disparate collection of items – historians of the period will each find different elements of the archives interesting. In general terms, the main thing to note is that these collections are more curated than many previous archives. Rather than digitise millions of pages of books and newspapers and then throw them together, the collections in NCCO are carefully compiled and well presented. There’s an impressive amount of background information provided for each archive, and brief summaries for most of the sub-collections too. Here’s what you’ll find if you access the Jack the Ripper letters from within the British Politics and Society collection:

There are pros and cons to using such a carefully curated archive. On the plus side, browsing through its contents is more user friendly – it’s much easier to casually meander through the archive when everything is clearly subdivided and signposted. The sub-collections should also make it easier for teachers to set more focused and manageable research tasks for undergraduate students. However, there are downsides to an archive in which all of the documents have been carefully picked out for their historical ‘significance’ and thematic relevance. Namely, the opportunity for new discoveries feels more limited. I’m sure that there are plenty of secrets still to be uncovered in NCCO‘s collections, but browsing through its contents isn’t quite as exciting as exploring the ‘vast terra incognita of print’ that has been opened up in recent years by large-scale newspaper digitisation projects. Each visit to the British Library Newspaper Archive [BLNA] brings with it the promise of exploring virgin territory; it’s likely that many of the articles you’ll encounter haven’t been read since the day they were published. By comparison, NCCO’s collections feel like well-trodden ground. Of course, the ability to search these documents by keyword should lead to new discoveries, connections, and perspectives that weren’t available using conventional archives.

Interface

The methodological possibilities of any digital archive are determined in large part by its interface. I’ve always been a fan of Gale’s work in this area – compared to their competitors, their interfaces and search tools are usually faster and more user friendly. The British Library Newspaper Archive isn’t without its design faults, but its interface is quicker than similar databases by ProQuest  and far more user-friendly than the disastrous efforts of UK Press Online. The BLNA, like the Times Digital Archive before it, was based on a relatively straightforward html interface which displayed its images as jpegs. This format allowed newspaper articles to load quickly and for users to save or copy them with a quick right-click of the mouse. It was simple, but it worked. In recent years, however, Gale has introduced a more high-tech, flash-based interface. Users of NewsVault and the Illustrated London News Digital Archive will already be familiar with the basic components of this new interface. Here’s how it looks:

 

It has some nifty new features – you can zoom in and out of an article more quickly (though not as smoothly as in the new British Newspaper Archive), alter brightness and contrast levels, rotate the image, view it in full-screen, and view separate sections of the source simultaneously by using the ‘split-screen’ feature. Newspaper articles are also displayed in their true context, with the rest of the page faded out slightly. The new interface lets you tag items (with both public and private keywords), create personal annotations and bookmarks, and export references to leading citation managers. The site is also compatible with Zotero – a welcome new feature that promises to make the organisation of primary research materials much easier. Unfortunately, the plugin just downloads the metadata for your chosen document and not the document itself.

Which leads us on to one of the problems with NCCO‘s new interface. It’s no longer possible to right click an image and save it as a jpeg. Instead, you have to use the archive’s own ‘download’ button – a feature that only allows you to save the document in pdf format. If you want to copy it over to a PowerPoint presentation, you’ll have to convert this pdf into an image file yourself or, alternatively, capture it as a screenshot. It’s perfectly possible, but it’s a nuisance and represents a regrettable step backwards in terms of speed and efficiency. Fortunately, the quality of the downloads is good – far better than the near-unreadable articles provided via the download feature of the British Newspaper Archive. It’s also possible to download the raw OCR data at a txt file. Gale’s decision to reveal this information is very welcome, but in this instance the BNA‘s solution is more elegant and its user-correction tool is more ambitious.

The other drawback of the flash interface is the space devoted to viewing documents. Put simply, the interface gets in the way. Here’s another screenshot. This time, I’ve shaded the interface red and left the area devoted to the document itself unshaded:

The first thing to note is the enormous amount of unused white space on either side of the archive’s main interface. I appreciate that not everybody has the luxury of using a 24″ widescreen monitor, but it’s a shame for this space to go unused when (as you can see) it’s not possible to see the entirety of the article in the small amount of space allotted to it. Contrast this interface with the old one used by the British Library Newspaper Archive:

Here, the whole screen is used and it’s possible (with a quick flick of my mouse’s scroll wheel) to view an entire newspaper page at once. The new interface certainly looks cleaner and more elegant, but this elegance comes at a cost. The most important thing about the database is the experience of browsing through its documents, but it currently feels like I’m looking at the world through a letterbox. This is particularly irritating when viewing newspapers. The full-screen feature provides a partial solution to this problem, but it’s a nuisance having to fire this up each time you want to view a document. For all of the powerful new search tools at our disposal, digital archives still require us to slog through hundreds (sometimes thousands) of potentially relevant sources before finding the ones that we need. In order to do this kind of research, it’s absolutely essentially to be able to examine and rule out irrelevant documents quickly. If you’ve got to enter full-screen, tweak the zoom level, and scroll around a bit before making these decisions it eats up time – an extra five seconds fiddling with each document soon mounts up over the course of a day’s research. Fortunately, the search interface includes a ‘Keywords in Context’ feature that allows you to preview the appearance of your search terms before loading an item in full – again, however, the BNA‘s solution of providing this contextual information by default (rather than after a mouse-click) is more elegant.

It’s hard to offer constructive solutions to these problems – flash interfaces provide us with some useful new tools, but I’ve yet to be convinced that the loss of speed and the cramped screen is worth it. A larger viewing area and a more fluid browsing experience would help to address some of the drawbacks.

 

Search Tools

NCCO’s search interface is typically powerful. As usual, it’s possible to select a number of different search types (Keyword, document title, entire document, etc) and limit searches by range of additional properties. Gale’s peculiar decision to draw a distinction between ‘keyword’ and ‘entire document’ searches remains a problem – I’ve lost count of the number of experienced researchers who mistakenly thought that they were searching the entire British Library Newspaper Database only for me to point out that they’d only been searching for ‘keywords’ (the title of the article plus the first few sentences). Gale are alone in this idiosyncratic use of the term ‘keyword’ and their decision to persist with it presents frustrating a obstacle to new users. Aside from this, however, the number of options available through the advanced search interface is excellent.

For digital humanities enthusiasts like me, perhaps the most exciting thing about NCCO is its two new search tools. First up is the Graphing Tool. Put simply, this new tools allows you to enter a keyword, specify a date range, and then track how often it appears in the archive using a line graph. A search for the term ‘America’ is displayed below:

An image like this should be familiar to fans of Google’s ngram viewer – a freely accessible search tool that lets you track the changing frequency of word usage in the Google Books archive. Tracking this kind of information is an imprecise way to map cultural change, but a carefully constructed search can identify broad trends and help researchers to view topics from a new perspective – I make occasional use of them in my PhD thesis and discuss their methodological potential in a forthcoming article for the Journal of Victorian Culture. So, I was undeniably excited when I learned that Gale was introducing a similar tool. Unfortunately, the results are disappointing. The tool is undermined by a fundamental methodological flaw. Put simply, it doesn’t take account of the fluctuating number of documents in the archive. If there are 1 million pages available for one year, but 10 million pages available for the next, it doesn’t take a genius to recognise that most graphs will have an upwards trajectory. Google solves this problem by measuring results as a percentage of the total number of words – that way, it doesn’t matter whether the archive expands or contracts. Unfortunately, NCCO’s graphing tool just displays the raw number of articles and makes no attempt to normalise the data.

Fluctuations in genre are also a problem. If coverage for the 1850s is mostly made up of newspapers, but the 1860s is dominated by political pamphlets, it’s impossible to make valid comparisons. The obvious solution to this problem is to allow users to select their own documents to search. Unfortunately, the graphing tool has been detached from the advanced search interface and has far less flexibility when it comes to constructing a query. It’s possible to restrict searches to four broad content types (manuscripts, maps, monographs, and newspapers), but this isn’t subtle enough to create methodologically sound searches. In sum, the tool is an interesting way to visualise search results but isn’t particularly useful for serious quantitative research. It’s a missed opportunity but, if it could be fixed, NCCO would represent an interesting step forward for digital research methodologies.

The second new feature is the Term Clusters tool. This text-mining tool identifies linguistic patterns and connections between documents. The graph below shows a search for the term ‘humour’:

The inner ring shows the terms that frequently appear within the first 100 words of each item in the search results – so, articles about humour frequently feature the words ‘novel’, ‘good’, and the term ‘Yankee Humour’. The outer ring performs the search again (this time on an inner-ring term) and reveals a new set of connections – so, articles featuring the term ‘Yankee Humour’ are also likely to include the words ‘miss’, ‘doctor’, and ‘heir’. Excerpts from these articles are displayed to the right. I confess that I was a bit confused by this tool at first, but the more I play with it the more impressed I’ve become. It’s a great way to identify previously unseen patterns and connections between material. I’d love to apply this tool (and a modified version of the graphing tool) to the British Library Newspaper Archive – with any luck, we’ll see them integrated into NewsVault sooner rather than later.

 

Conclusions

So, all in all, there’s some good news and bad news here. The contents of the archive are interesting, eclectic and well curated. There’s plenty here for researchers to get stuck into and the sub-collections will provide some interesting teaching opportunities. The interface has a lot of useful new features, but the move from html to flash continues to result in a clunky and cramped browsing experience. The core search interface is excellent and the introduction of innovative new search tools is exciting. Term clusters are a particularly intriguing new addition to our armoury, but the graphing tool needs a bit more work before its full potential is fulfilled. It’s too early to tell whether NCCO will have the same impact as its eighteenth-century predecessor. It’s entering a far more crowded market place (the sheer volume of nineteenth-century material available in digital archives is already staggering) and doing so at a time when library budgets are contracting. However, there’s enough here to suggest that NCCO may well become the next leading digital platform for nineteenth-century research – if they iron out a few of the problems this wouldn’t be a bad thing.


Jan 10 2012

Unlocking the Potential of Digital Archives

Last night Jim Mussell posted an excellent review of the British Newspaper Archive on his blog. He makes a number of really important points that I skirted over in my own review. I recommend reading Jim’s post in its entirety. However, one of his arguments is particularly worth emphasizing:

 

This leads me to my second point: the way brightsolid have digitized this material also restricts possible uses. This is a resource for finding articles, not reading newspapers, and this is done by brightsolid’s search engine and database on the user’s behalf. There is no scope here for data mining, for analysis of textual transcripts, or for the interrogation of metadata. This actually runs counter to the dominant trend within both the digital humanities and commercial digital publishing, making BNA seem a little old fashioned. Gale Cengage’s NCCO, for instance, allows users to carry out rudimentary data mining. This is no mere moan about the way the project was executed. Taking advantage of the digital properties of digitized materials is the way in which we learn new things about them. Locking the data away means that users are stuck with old methodologies, treating the articles as if they were printed paper even though they clearly aren’t….

… There is no chance for any of this content to enter digital culture, becoming recontextualized as it interacts with other content; instead, it is trapped within the interface, pretending that it is paper, so users can read articles, one after the other. On these terms, it must be said, the BNA is excellent (and let me repeat, the page viewer is one of the best I have seen); but as a resource that contributes to the UK economy, scholarship, or even one that helps us learn more about nineteenth-century print culture, it is limited.

 

I can’t even begin to stress how important this is.  The practical benefits of digitisation are well recognised. Improvements in speed, access, volume, and convenience are routinely celebrated. When asked to describe how digital archives have changed their lives, many historians highlight the fact that they no longer have to visit the British Library whenever they want to consult a newspaper. Others rejoice that their lives are no longer blighted by malfunctioning microfilm readers. Keyword search engines are widely recognised as a time saving device; a handy tool which helps researchers to find material quicker than by hand. So far, in other words, digitisation has largely been treated as a practical revolution – it has made research faster, easier, more convenient, and more productive.

These practical improvements are welcome, but digitisation is capable of so much more. It has the potential not just to change the day-to-day practice of research, but to fundamentally alter the kind of research that we are able do. Used creatively, it allows us to access and explore past cultures and societies in powerful new ways; to ask new questions, make new connections, construct new arguments, explore new topics, and re-examine old ones from new perspectives.  It allows us to imagine new kinds of research. 

In order to unlock these new methodological possibilities we need to be able to take full advantage of what Jim terms the “digital properties of digitized materials”. Researchers in the digital humanities have already started to do this with other archives of nineteenth-century print culture. Dan Cohen and Fred Gibbs have been text mining millions of titles of nineteenth-century books in order to explore changes in the Victorian frame of mind. A team of Harvard scientists have recently given this particular brand of the Digital Humanities the name of ‘culturomics’. In their study, they text-mined a corpus of 5 million digitised books and quantified the evolution of grammar, the speed at which society forgets its past, the adoption of new technologies, the effects of censorship, and the changing nature of fame. Best of all, this project inspired the creation of Google’s Ngram Viewer – a publicly accessible tool for plotting the frequency of words in the Google Books archive.

This research is still in its embryonic stage, but it hints at future possibilities. Unfortunately, we are currently unable to interrogate nineteenth-century newspaper archives with the same freedom and creativity. The raw materials are all in place – sources have been digitised and marked up with usable metadata – but the interfaces don’t allow us to ask the right questions. They’re designed for one, very basic form of digital research: keyword searches that lead to close reading.

If we want to do anything more ambitious, we need to design new interfaces. Recent projects like Connected Histories and Locating London’s Past are great examples of how this  can work. Both websites allow researchers to explore existing archives in new ways. It is now possible, for example, to plot cases from the Old Bailey Online archive onto an 18th century map of London.

This is where the key problem with the BNA arises. By giving control of the archive to Brightsolid and allowing them to put it behind a paywall, the British Library have prevented researchers from developing similarly innovative new ways of exploring its data. Without the freedom to develop new interfaces, we lose the power to frame new questions. Without the power to frame new questions, we won’t be able to find new answers. The potential of digitisation to reveal new insights into the past will be squandered.

The good news is that it’s not too late to fix these problems. The data is there to be reused, if its ‘owners’ will allow us. As Jim argues:

One can only hope that the British Library does not now consider this material ‘done’, It is essential that they recognize that this is one possible implementation, one possible representation of this content amongst many others, and so should be open to other uses of the data – whether transcripts, page images, or metadata – that might come along in the future.

 

 


Dec 12 2011

British Newspaper Archive – changes to the ‘fair usage’ cap.

When the British Newspaper Archive was launched a few weeks back a lot of researchers were frustrated to discover that the ‘unlimited’ subscription package actually had a ‘fair use’ cap of 1000 page views per month. When I e-mailed the archive’s customer service team about it they informed me that the archive was intended for ‘personal use’ only and that the cap was non-negotiable. Fortunately, they seem to have had a slight change of heart. The ‘fair usage’ section of the archive’s terms & conditions has now been updated to read:

Why do we have a fair usage policy for subscribers? Well, it is certainly not a way to penalise or hold back our customers from conducting their personal research.

We have this in place purely for the (very rare) cases where people might abuse the service, and it is designed to keep the price of subscriptions as low as possible for our customers.

You are permitted to view an average of 1000 pages per month (calculated over a 3 month period). If you get close to the limit, we’ll send you an email to warn you. We always contact users to establish the reason for abnormally heavy use of the site and if they’re just doing their own personal research, we obviously don’t penalise them.

We constantly review the limit, based on average usage of the site by all users. We will continue to keep an eye on this and make adjustments as necessary.

Many services today (such as broadband packages) have similar fair usage policies and they work in the same way as ours i.e they are designed to catch those who use the service excessively (which would drive up the price or reduce the quality of service for the majority of users).

We hope this explains things – Please contact Customer Support if you have any further questions

This looks like good news. The three month average is definitely a welcome concession. It’s hard to interpret precisely what happens when you exceed the limit now – they seem to be suggesting that users will be contacted and exempted from the restrictions if they’re just using the archive for personal research. I’d still like to see how this works in practice before paying for an £80 subscription, but it looks like the problem has been resolved. Well done to all who complained about it and credit to the BNA for listening to our concerns.

 


Dec 11 2011

BNA security problems – bad link to blame

If you clicked on any of the hotlinks in my review of the British Newspaper Archive you might have been taken to an address with “www1.” at the start. If you were also using IE or Firefox this might have resulted in your browser warning you about a security risk. It’s a false alarm; a minor glitch that stems from the addition of the “1″ after “www”. The BNA have assured us that their website is completely secure and that the problem has now been resolved. I’ve fixed the links in my own review – if you’ve linked to the archive on your own blog it would be worth double checking to make sure that the address is correct.

Thanks to Charles Robinson for alerting me to the problem.


Dec 5 2011

Hit-term Highlighting: a half-baked solution

In my recent review of The British Newspaper Archive I moaned about the fact that ‘hit-term highlightingwas mysteriously absent from its interface. Unlike every other archive on the market, the BNA doesn’t highlight your search term on the article image. Here’s how it works in other databases:

In this example, I performed a keyword search for the term ‘Victorian’. One of the articles it returned was this lengthy piece from the Liverpool Mercury. It’s 5616 words long. Fortunately, thanks to hit-term highlighting, I can just skip straight to the word shaded in green and read the part of the article that I’m interested in. A similar search on the BNA would require me to carefully read a column and a half of text in order to find the word I searched for. This really slows down the research process when you’ve got 500 articles to analyse.

With any luck, brightsolid will address this problem with an update to the BNAs interface. This might take a while – in the meantime, there’s a temporary solution to the problem that should save us all a bit of time:

Step 1: perform a normal keyword search.

Step 2: open up an article.

Step 3: Click the ‘Show Article Text’ button at the top of the left hand menu. This reveals the raw OCR text sitting beneath your chosen article.

Step 4: Open your web browser’s ‘find’ tool. The quickest way to do this is to press ‘ctrl+f”

Step 5: Type your keyword into the ‘find’ tool. This should highlight all instances of that word which appear on the page – including the place it appears in the raw OCR.

Step 6: Find and click your keyword in the raw OCR.

Step 7: This should place a thin black box around a line of the article image. Within this box, you’ll find your keyword.

Here’s a video of me searching an article for the term ‘sleeper’:


Dec 4 2011

The Past Belongs to Brightsolid

On Friday night I had an illuminating Twitter conversation with Will Tattersdill (@faceometer) – a fellow researcher who shares some of my concerns about the new British Library Newspaper Archive. He pointed out an interesting passage in the archive’s terms and conditions:

What you can use the service for:

You can only use the website for your own personal non-commercial use e.g. to research newspaper archives and other archives featured on the website that you are interested in and to purchase goods that we may sell on the website. We are also happy for you to help out other people by telling them about the newspaper archives and other information available on the website and how and where they can be found. However, you must not provide them with copies of any of the newspapers (either an original image of the newspapers or the information on the results page), even if you provide them for free.

It’s easy to brush this off as a classic example of small-print gobbledegook - the  kind of thing we all mindlessly agree to every time we’re forced to update iTunes. But, the more I think about it, the more astonishing this passage seems to be. Are they really suggesting that we can’t show copies of their digital newspapers to other people? Even worse, are they suggesting that we can’t even share the information contained within them? It’s one thing to prevent people from making a profit from these materials, but to try and prohibit us from sharing the fruits of our research with friends, colleagues, and students is truly remarkable. Perhaps I’m jumping the gun here, but does this mean I can’t describe the results of a search in an academic article? Am I prohibited from displaying an a newspaper page via powerpoint in an undergraduate lecture? By posting a screenshot of a (barely legible) article in my review, have I broken their terms and conditions?

I’m not sure. But it’s prompted me to ask an important question: who really owns this material? Almost all of the newspapers in the  BNA are out of copyright and have been preserved by the British Library at the expense of the taxpayer. They belong to us, and we’re all free to copy and quote from them as much as we like. However, it seems that digitised newspapers are an entirely different story. When an out-of-copyright text is scanned, the resulting ‘digital object’ is subject to new copyright protection. More significantly, this copyright isn’t held by the original writers and publishers, but by the library or digitisation company that performs the scans. In legal terms, it seems that we’re not actually browsing the British Library’s newspaper archive but accessing brightsolid’s collection of digitised texts.

This might seem like a minor distinction, but it has important implications. The BNA is intended to replace Colindale as home of the nation’s historical newspaper collections. However, in order to fund this transition, the British Library has allowed a commercial publisher to assume ownership of the new archive’s contents. It’s up to this commercial company to determine how we access the archive and what we can do with its materials.  The past has been privatised. This is brightsolid’s world now – we just live in it.

Edit: a few additional thoughts in the comments.


Dec 1 2011

Review: The British Newspaper Archive

 

Christmas arrived early for historians this week. On Tuesday morning, amid a blaze of publicity, the British Library unveiled the new home of its digitised newspaper collection – The  British Newspaper Archive (BNA).

Developed in partnership with commercial publisher brightsolid, the BNA provides online access to hundreds of eighteenth, nineteenth and early-twentieth-century newspapers. It’s an ambitious, long-term project – more than 3 million pages have already been digitised and the library hopes to reach 40 million pages over the next decade. If the project is successful it’ll have important implications for both professional and amateur historians. In the next couple of years, the British Library intends to close its newspaper archive at Colindale and transfer its holdings to a remote storage facility in Boston Spa. Whilst hard copies of undigitised newspapers should still be accessible, it’s clear that the British Library wants more and more researchers to access its collections online.  The BNA, in other words, is the shape of things to come – and it’s vitally important that the British Library gets it right.

 

Content

The archive currently provides access to 170 newspapers. Many of these papers were available in the 19th Century British Library Newspapers database and have been transferred directly into the new archive. Only the Penny Illustrated Paper (which always seemed slightly out of place in the previous database) has been omitted from the new collection. Unfortunately, this means that gaps in the original archive are still a problem – the Northern Echo, for example, still has content missing from the crucial period between 1871 and 1872 when W. T. Stead first took over as editor. On the plus side, glitches from the previous database have been solved. The Preston Chronicle, for example, is no longer incorrectly listed as the Preston Guardian.

The real strength of the archive lies in its new content. 100 new newspapers are now accessible for the first time – almost all of them provincial papers. A full list of these papers is available here. Highlights from the new collection include long runs of the Bath Chronicle (1760-1903), the Chelmsford Chronicle (1783-1882), the Leeds Times (1833-1901), the Manchester Evening News (1870-1903), the Northampton Mercury (1770-1903), the Worcestershire Chronicle (1838-1903), and the Yorkshire Gazette (1819-1899). The new database is far less London-centric than previous offerings – most areas of the country are now represented by at least one paper, and major cities like Manchester, Liverpool, Birmingham, and Sheffield have multiple titles.

Whilst the majority of this new content focuses on the nineteenth century, some papers stretch deep into the eighteenth and twentieth centuries. Twelve titles include at least a decade of issues from the eighteenth century, including the Birmingham Gazette which goes all the way back to 1741. Whilst numerous papers cover the first three years of the twentieth century, only four titles stretch beyond the first decade: The Cheltenham Looker-On (1913), The Motherwell Times (1924), the Nottingham Evening Post(1944), and the Western Times (1940). It’s extremely encouraging to see the British Library push beyond the boundary of 1900 – let’s hope that this is first step towards bridging the ‘digital divide’ which has recently sprung up between 19th and 20th century history.

Perhaps the most exciting thing about the new database is the promise of more content. Unlike previous databases, which were updated in bulk every year or so, the holdings of the BNA are constantly being expanded. 8000 new pages are supposedly being uploaded to the website every day. Unfortunately, it’s not possible to quickly see which papers have recently been added or updated – this makes it difficult to keep abreast of the archive’s changing contents. Nor, for that matter, does the British Library give any hints about which papers will be digitised in the future. There’s no way of knowing whether a publication that’s critical to your research will appear in the archive tomorrow morning or in 10 years’ time. There’s something exciting about this I suppose. As the website itself points out, “who knows what you’ll find tomorrow, next week, next year, and beyond”. However, I suspect we’ll have to rethink and shore-up our methodologies in order to build research projects on constantly shifting sands – more on this in a future blog post.


Search Engine

A good search engine is crucial to the success of a digital archive – the methodological possibilities of these resources are determined primarily by the questions they allow us to ask. The BNA has most of the tools we’ve come to expect from newspaper databases. Users can perform a basic ‘Search’ by imputing keywords into a single search box, or they can construct more complex queries using the ‘Advanced Search’ page. The ‘Advanced’ interface allows users to put keywords into four boxes:

 

 

  1. The first option searches for articles which include all inputted keywords. So, putting “America, Twain, New York” into the search box will find all articles which include these three terms somewhere in the text. Articles which only include the words ‘America’ and ‘Twain’ will not be found. For those of you familiar with Boolean searches, this is basically the equivalent of using ‘AND’.
  2. The second option searches for articles which include any (but not necessarily all) of the inputted keywords. So, this time a search for “America, Twain, New York” will find every article in which at least one of these terms appears. This will return articles containing the word ‘America’ which don’t feature ‘Twain’ or ‘New York’. In Boolean terms, this is a straightforward ‘OR’ search.
  3. The third option allows users to exclude articles containing certain keywords. So, we might search for articles featuring the word ‘Twain’ which do not contain a reference to ‘America’. In Boolean terms, this is the equivalent of ‘NOT’.
  4. Finally, the fourth box allows users to search for a complete phrase. This returns articles which feature keywords in a particular order and is broadly equivalent of enclosing an ordinary keyword search in quotation marks.

These searches can then be filtered by:

  • Place of publication
  • Publication title
  • Date
  • Article type (Advertisement, Article, Family Notice, Illustrated, Miscellaneous)
  • Public tag – more on this later.

Search results can be ordered by either ‘relevance’ or ‘date’ – this makes a nice change from Gale databases which only display results in chronological order.

Once a search has been performed, results can be filtered again by date, title, region, country, place, article type, and ‘public tag’. Crucially, the public tag feature allows articles to be sorted by additional categories, including: classifieds, adverts, news, commerce, arts, sport, crime, etc. The accuracy of these tags (many of which seem to have been imported from the previous database) isn’t great, but they can be helpful when filtering out irrelevant articles.

 

All of this works fairly well – if anything, the search engine is faster and more user-friendly than in previous databases. Unfortunately, some functionality has been lost. Most importantly, ‘proximity operators’ are no longer available. In the previous database, a search for “Twain n10 America” returned all articles in which the words ‘Twain’ and ‘America’ appeared no more than 10 words apart. This was a tremendously useful way to filter out results in which keywords appeared too far apart – it saved a lot of time and opened up a range of interesting methodological possibilities. In my thesis, for example, I use proximity operators to track changes in the number of articles featuring the words ‘America’ and ‘Competition’ in close proximity. It would be tremendously useful if this essential tool was reinstated in the new archive.

As for more advanced search methodologies like datamining or ‘culturomics’ – the chances of seeing the necessary tools introduced into the new database are slim-to-none.

 

Interface

The BNA’s interface is a mixed bag. It includes some welcome new additions. Each search result is now accompanied by a snippet of scanned text which helps users to decide whether an article is relevant before opening it – this should save a lot of time when wading through thousands of hits. Similarly, articles are now displayed within the full newspaper page – this makes it possible to zoom out using your mouse’s scroll wheel and explore the rest of the page. This should please historians who have (quite rightly) been warning us about the danger of viewing articles in isolation.

Unfortunately, this is where the good news ends. The BNA’s interface suffers from at least two major problems:

  1. 1.       No hit-term highlighting.
    In previous databases, keywords would be highlighted in colour whenever you opened an article. This made is easy to quickly identify which parts of long articles you wanted to read. Every database since the Times Digital Archive has had this feature – it’s absolutely essential. Without hit-term highlighting, wading through a 2000 word article in search of a single keyword is a laborious chore. To do this 100 times in a day is infuriating and massively slows down the research process. I can’t even begin to fathom why the BNA doesn’t include it.  Its absence is an inexcusable step backwards. If another element of the interface prevents the use of hit-term highlighting (such as the nice new zoomable images) then it needs to be unceremoniously scrapped. Right now.
  2. 2.       Saving articles.
    In the 19th Century British Library Newspaper database, downloading an article was as easy as right clicking it and saving it to a relevant folder. It was quick, easy, flexible, and resulted in easily reusable jpg files. Now, articles can only be downloaded as full-page pdfs. If you want to paste an article into a word document, slot it into a powerpoint presentation, or upload it to twitter, you’ll have to convert it back into a jpg. To make matters worse, the quality of these files is embarrassingly low – in fact, it’s virtually impossible to read them. Here’s a sample:

Fortunately, a solution is at hand: for the low, low price of £35.95 the good people at brightsolid will print out a high-quality version of the page and send it to you through the post. Alternatively, you might prefer to use the print-screen key or the ‘Snipping Tool’ included with recent versions of Windows and save a more readable version for free.

OCR

Ensuring the accuracy of optical character recognition software (OCR) has always been one of the biggest challenges facing newspaper digitisation projects. Even the best software produces patchy results – some articles are transcribed with 100% accuracy, whilst others end up a garbled mess.  As a result, software companies have typically preferred to hide raw OCR text from users; if we knew how inconsistent it was, they worry, we’d lose all faith in their product. So, it’s refreshing to see that the BNA openly displays raw, uncorrected OCR text alongside articles. It might put some users off, but we end up with a much better feel for how accurate our searches are.

More impressively, the BNA allows users to correct OCR errors and improve the database for other users. The interface for this process works fairly well. Lines of OCR text are displayed for correction on the left, and a black box highlights the specific area of the article which needs to be transcribed. A red box might have been slightly easier to see amidst the newsprint, but perhaps I’m being picky. In truth, the fact that this idea has been implemented so effectively makes the absence of hit-term highlighting doubly perplexing.

 

It remains to be seen how many users will bother to make corrections. I’d like to see the process incentivised a bit more –perhaps we could earn credits (more on them shortly) for each article we correct? It’s also unclear how the BNA intends to moderate corrections and prevent people from defacing the archive. However, I don’t want to be too critical of what is undoubtedly a step in the right direction. Whilst this form of ‘crowdsourcing’ won’t deliver 100% accurate ocr across the whole database – it would take thousands of users correcting around the clock to keep up with the 8,000 new pages added  each day – it’s certainly better than nothing.

In addition to OCR corrections, users can also ‘tag’ articles with their own descriptive keywords. If enough users take advantage of this feature it promises to be another tremendous innovation. I suspect it’ll be particularly useful for finding images.

 

Subscriptions

Finally, we reach the dreaded question: how much does all of this cost? It would be nice if the British Library followed the example of their colleagues in Australia and New Zealand and allowed us to explore the archive for free. Sadly, in order to cover the cost of digitisation, the British Library has had to turn the content over to a commercial publisher. Unlike their previous partner Gale (which caters primarily to the academic market), brightsolid has a background in targeting amateur genealogists with websites like findmypast.co.uk and 1911census.co.uk. As a result, the BNA is presently only available to individual subscribers. This renders it immediately unusable for teaching. JISC claim to be in negotiations with the British Library and brightsolid to provide institutional access to the database – until this happens, the BNA won’t be of any use in the classroom.

Three packages are currently available to individual subscribers:

  • 2 days (500 credits) – £6.95
  • 30 days (3000 credits) – £29.95
  • 12 months (unlimited access*) – £79.95

The ‘credit’ system is a bit complicated. It costs 5 credits to view an article published over 107 years ago in black and white, 10 credits to view similar articles in colour, and 15 credits to view articles published within the last 107 years. It’s fair to say, having bought the 2 day package to test the database out, that these credits don’t go very far. Browsing through one 20th century issue of the Nottingham Evening Post wiped out a quarter of my credits in five minutes.

For serious researchers, the 12 month unlimited subscription is the only real option. At first glance, £80 seems fairly reasonable – I’d spend way more than that on a two-day research trip to Colindale. However, buried in the small-print is a rather unpleasant surprise. If subscribers to the ‘unlimited’ package view more than 1000 pages in a calendar month, their account is frozen until the start of the next month. For some researchers, this cap will be perfectly tolerable. Unfortunately, as a press historian I’d expect to burn through at least 500 page views on a routine day of research. I’d be locked out of the database for 28 days of every month (save February, which has 28 days clear and 29 nine in a leap year). These quotas place an unacceptable restriction on research – I never want to be in a situation where my decision to read an article is determined not by its potential value to my research, but by the number of credits left in my account.

I e-mailed the archive’s customer service team and informed them that the cap would make many forms of academic research extremely difficult. They informed me that the BNA was intended for ‘personal use’ only. It’s nice to know where we stand.

[edit: good news -  the 1000 cap seems to have been relaxed]

Conclusion

In sum, there’s a lot to like about The British Newspaper Archive. The open approach to OCR, the introduction of crowdsourcing, and, above all, the incredible range of new content makes it a potentially fantastic new tool for researchers. I want to love it. Unfortunately, it currently suffers from at least four critical faults. The lack of hit-term highlighting, the inability to download a usable version of an article, the absence of institutional subscriptions, and the misjudged cap on the ‘unlimited’ package are all in need of urgent attention. Until these issues are fixed, its potential for academic research (not to mention its usefulness in the classroom) will remain frustratingly limited.