Review: The British Newspaper Archive
Christmas arrived early for historians this week. On Tuesday morning, amid a blaze of publicity, the British Library unveiled the new home of its digitised newspaper collection – The British Newspaper Archive (BNA).
Developed in partnership with commercial publisher brightsolid, the BNA provides online access to hundreds of eighteenth, nineteenth and early-twentieth-century newspapers. It’s an ambitious, long-term project – more than 3 million pages have already been digitised and the library hopes to reach 40 million pages over the next decade. If the project is successful it’ll have important implications for both professional and amateur historians. In the next couple of years, the British Library intends to close its newspaper archive at Colindale and transfer its holdings to a remote storage facility in Boston Spa. Whilst hard copies of undigitised newspapers should still be accessible, it’s clear that the British Library wants more and more researchers to access its collections online. The BNA, in other words, is the shape of things to come – and it’s vitally important that the British Library gets it right.
The archive currently provides access to 170 newspapers. Many of these papers were available in the 19th Century British Library Newspapers database and have been transferred directly into the new archive. Only the Penny Illustrated Paper (which always seemed slightly out of place in the previous database) has been omitted from the new collection. Unfortunately, this means that gaps in the original archive are still a problem – the Northern Echo, for example, still has content missing from the crucial period between 1871 and 1872 when W. T. Stead first took over as editor. On the plus side, glitches from the previous database have been solved. The Preston Chronicle, for example, is no longer incorrectly listed as the Preston Guardian.
The real strength of the archive lies in its new content. 100 new newspapers are now accessible for the first time – almost all of them provincial papers. A full list of these papers is available here. Highlights from the new collection include long runs of the Bath Chronicle (1760-1903), the Chelmsford Chronicle (1783-1882), the Leeds Times (1833-1901), the Manchester Evening News (1870-1903), the Northampton Mercury (1770-1903), the Worcestershire Chronicle (1838-1903), and the Yorkshire Gazette (1819-1899). The new database is far less London-centric than previous offerings – most areas of the country are now represented by at least one paper, and major cities like Manchester, Liverpool, Birmingham, and Sheffield have multiple titles.
Whilst the majority of this new content focuses on the nineteenth century, some papers stretch deep into the eighteenth and twentieth centuries. Twelve titles include at least a decade of issues from the eighteenth century, including the Birmingham Gazette which goes all the way back to 1741. Whilst numerous papers cover the first three years of the twentieth century, only four titles stretch beyond the first decade: The Cheltenham Looker-On (1913), The Motherwell Times (1924), the Nottingham Evening Post(1944), and the Western Times (1940). It’s extremely encouraging to see the British Library push beyond the boundary of 1900 – let’s hope that this is first step towards bridging the ‘digital divide’ which has recently sprung up between 19th and 20th century history.
Perhaps the most exciting thing about the new database is the promise of more content. Unlike previous databases, which were updated in bulk every year or so, the holdings of the BNA are constantly being expanded. 8000 new pages are supposedly being uploaded to the website every day. Unfortunately, it’s not possible to quickly see which papers have recently been added or updated – this makes it difficult to keep abreast of the archive’s changing contents. Nor, for that matter, does the British Library give any hints about which papers will be digitised in the future. There’s no way of knowing whether a publication that’s critical to your research will appear in the archive tomorrow morning or in 10 years’ time. There’s something exciting about this I suppose. As the website itself points out, “who knows what you’ll find tomorrow, next week, next year, and beyond”. However, I suspect we’ll have to rethink and shore-up our methodologies in order to build research projects on constantly shifting sands – more on this in a future blog post.
A good search engine is crucial to the success of a digital archive – the methodological possibilities of these resources are determined primarily by the questions they allow us to ask. The BNA has most of the tools we’ve come to expect from newspaper databases. Users can perform a basic ‘Search’ by imputing keywords into a single search box, or they can construct more complex queries using the ‘Advanced Search’ page. The ‘Advanced’ interface allows users to put keywords into four boxes:
- The first option searches for articles which include all inputted keywords. So, putting “America, Twain, New York” into the search box will find all articles which include these three terms somewhere in the text. Articles which only include the words ‘America’ and ‘Twain’ will not be found. For those of you familiar with Boolean searches, this is basically the equivalent of using ‘AND’.
- The second option searches for articles which include any (but not necessarily all) of the inputted keywords. So, this time a search for “America, Twain, New York” will find every article in which at least one of these terms appears. This will return articles containing the word ‘America’ which don’t feature ‘Twain’ or ‘New York’. In Boolean terms, this is a straightforward ‘OR’ search.
- The third option allows users to exclude articles containing certain keywords. So, we might search for articles featuring the word ‘Twain’ which do not contain a reference to ‘America’. In Boolean terms, this is the equivalent of ‘NOT’.
- Finally, the fourth box allows users to search for a complete phrase. This returns articles which feature keywords in a particular order and is broadly equivalent of enclosing an ordinary keyword search in quotation marks.
These searches can then be filtered by:
- Place of publication
- Publication title
- Article type (Advertisement, Article, Family Notice, Illustrated, Miscellaneous)
- Public tag – more on this later.
Search results can be ordered by either ‘relevance’ or ‘date’ – this makes a nice change from Gale databases which only display results in chronological order.
Once a search has been performed, results can be filtered again by date, title, region, country, place, article type, and ‘public tag’. Crucially, the public tag feature allows articles to be sorted by additional categories, including: classifieds, adverts, news, commerce, arts, sport, crime, etc. The accuracy of these tags (many of which seem to have been imported from the previous database) isn’t great, but they can be helpful when filtering out irrelevant articles.
All of this works fairly well – if anything, the search engine is faster and more user-friendly than in previous databases. Unfortunately, some functionality has been lost. Most importantly, ‘proximity operators’ are no longer available. In the previous database, a search for “Twain n10 America” returned all articles in which the words ‘Twain’ and ‘America’ appeared no more than 10 words apart. This was a tremendously useful way to filter out results in which keywords appeared too far apart – it saved a lot of time and opened up a range of interesting methodological possibilities. In my thesis, for example, I use proximity operators to track changes in the number of articles featuring the words ‘America’ and ‘Competition’ in close proximity. It would be tremendously useful if this essential tool was reinstated in the new archive.
As for more advanced search methodologies like datamining or ‘culturomics’ – the chances of seeing the necessary tools introduced into the new database are slim-to-none.
The BNA’s interface is a mixed bag. It includes some welcome new additions. Each search result is now accompanied by a snippet of scanned text which helps users to decide whether an article is relevant before opening it – this should save a lot of time when wading through thousands of hits. Similarly, articles are now displayed within the full newspaper page – this makes it possible to zoom out using your mouse’s scroll wheel and explore the rest of the page. This should please historians who have (quite rightly) been warning us about the danger of viewing articles in isolation.
Unfortunately, this is where the good news ends. The BNA’s interface suffers from at least two major problems:
- 1. No hit-term highlighting.
In previous databases, keywords would be highlighted in colour whenever you opened an article. This made is easy to quickly identify which parts of long articles you wanted to read. Every database since the Times Digital Archive has had this feature – it’s absolutely essential. Without hit-term highlighting, wading through a 2000 word article in search of a single keyword is a laborious chore. To do this 100 times in a day is infuriating and massively slows down the research process. I can’t even begin to fathom why the BNA doesn’t include it. Its absence is an inexcusable step backwards. If another element of the interface prevents the use of hit-term highlighting (such as the nice new zoomable images) then it needs to be unceremoniously scrapped. Right now.
- 2. Saving articles.
In the 19th Century British Library Newspaper database, downloading an article was as easy as right clicking it and saving it to a relevant folder. It was quick, easy, flexible, and resulted in easily reusable jpg files. Now, articles can only be downloaded as full-page pdfs. If you want to paste an article into a word document, slot it into a powerpoint presentation, or upload it to twitter, you’ll have to convert it back into a jpg. To make matters worse, the quality of these files is embarrassingly low – in fact, it’s virtually impossible to read them. Here’s a sample:
Fortunately, a solution is at hand: for the low, low price of £35.95 the good people at brightsolid will print out a high-quality version of the page and send it to you through the post. Alternatively, you might prefer to use the print-screen key or the ‘Snipping Tool’ included with recent versions of Windows and save a more readable version for free.
Ensuring the accuracy of optical character recognition software (OCR) has always been one of the biggest challenges facing newspaper digitisation projects. Even the best software produces patchy results – some articles are transcribed with 100% accuracy, whilst others end up a garbled mess. As a result, software companies have typically preferred to hide raw OCR text from users; if we knew how inconsistent it was, they worry, we’d lose all faith in their product. So, it’s refreshing to see that the BNA openly displays raw, uncorrected OCR text alongside articles. It might put some users off, but we end up with a much better feel for how accurate our searches are.
More impressively, the BNA allows users to correct OCR errors and improve the database for other users. The interface for this process works fairly well. Lines of OCR text are displayed for correction on the left, and a black box highlights the specific area of the article which needs to be transcribed. A red box might have been slightly easier to see amidst the newsprint, but perhaps I’m being picky. In truth, the fact that this idea has been implemented so effectively makes the absence of hit-term highlighting doubly perplexing.
It remains to be seen how many users will bother to make corrections. I’d like to see the process incentivised a bit more –perhaps we could earn credits (more on them shortly) for each article we correct? It’s also unclear how the BNA intends to moderate corrections and prevent people from defacing the archive. However, I don’t want to be too critical of what is undoubtedly a step in the right direction. Whilst this form of ‘crowdsourcing’ won’t deliver 100% accurate ocr across the whole database – it would take thousands of users correcting around the clock to keep up with the 8,000 new pages added each day – it’s certainly better than nothing.
In addition to OCR corrections, users can also ‘tag’ articles with their own descriptive keywords. If enough users take advantage of this feature it promises to be another tremendous innovation. I suspect it’ll be particularly useful for finding images.
Finally, we reach the dreaded question: how much does all of this cost? It would be nice if the British Library followed the example of their colleagues in Australia and New Zealand and allowed us to explore the archive for free. Sadly, in order to cover the cost of digitisation, the British Library has had to turn the content over to a commercial publisher. Unlike their previous partner Gale (which caters primarily to the academic market), brightsolid has a background in targeting amateur genealogists with websites like findmypast.co.uk and 1911census.co.uk. As a result, the BNA is presently only available to individual subscribers. This renders it immediately unusable for teaching. JISC claim to be in negotiations with the British Library and brightsolid to provide institutional access to the database – until this happens, the BNA won’t be of any use in the classroom.
Three packages are currently available to individual subscribers:
- 2 days (500 credits) – £6.95
- 30 days (3000 credits) – £29.95
- 12 months (unlimited access*) – £79.95
The ‘credit’ system is a bit complicated. It costs 5 credits to view an article published over 107 years ago in black and white, 10 credits to view similar articles in colour, and 15 credits to view articles published within the last 107 years. It’s fair to say, having bought the 2 day package to test the database out, that these credits don’t go very far. Browsing through one 20th century issue of the Nottingham Evening Post wiped out a quarter of my credits in five minutes.
For serious researchers, the 12 month unlimited subscription is the only real option. At first glance, £80 seems fairly reasonable – I’d spend way more than that on a two-day research trip to Colindale. However, buried in the small-print is a rather unpleasant surprise. If subscribers to the ‘unlimited’ package view more than 1000 pages in a calendar month, their account is frozen until the start of the next month. For some researchers, this cap will be perfectly tolerable. Unfortunately, as a press historian I’d expect to burn through at least 500 page views on a routine day of research. I’d be locked out of the database for 28 days of every month (save February, which has 28 days clear and 29 nine in a leap year). These quotas place an unacceptable restriction on research – I never want to be in a situation where my decision to read an article is determined not by its potential value to my research, but by the number of credits left in my account.
I e-mailed the archive’s customer service team and informed them that the cap would make many forms of academic research extremely difficult. They informed me that the BNA was intended for ‘personal use’ only. It’s nice to know where we stand.
[edit: good news - the 1000 cap seems to have been relaxed]
In sum, there’s a lot to like about The British Newspaper Archive. The open approach to OCR, the introduction of crowdsourcing, and, above all, the incredible range of new content makes it a potentially fantastic new tool for researchers. I want to love it. Unfortunately, it currently suffers from at least four critical faults. The lack of hit-term highlighting, the inability to download a usable version of an article, the absence of institutional subscriptions, and the misjudged cap on the ‘unlimited’ package are all in need of urgent attention. Until these issues are fixed, its potential for academic research (not to mention its usefulness in the classroom) will remain frustratingly limited.