Digitising the British Library
When you hear the word eBooks, you probably think of the latest David Mitchell novel or cutting-edge devices such as the Sony Reader. But the British Library is currently embarking on a gargantuan project to create digital copies of 100,000 books published more than a century before Mitchell rattled a keyboard in anger.
PC Pro was recently given a behind-the-scenes tour of this impressive feat of computing power and old-fashioned manual labour. Deep in the bowels of the British Library, a small team of contractors are transforming the 19th-century scripts into digital pages that will be available for anyone to view from their web browser. Rare British books that have sat untouched on shelves for decades will now be fully searchable by anyone looking for something beyond Dickens for a view of 19th-century life, culture and history.
“Teachers have been wanting to expand the curriculum to engage with other types of text, but it’s been impossible because the texts weren’t available,” says Dr Kristian Jensen, head of British and early printed collections at the British Library. “Now we can digitise things as they appear on the shelves – we don’t need to take a book here, a book there.”
We’ll explore the technology and processes involved in digitising 25 million pages. We’ll also reveal exactly how books go from shelf to screen, and the difficulties posed by dealing with such delicate manuscripts. Finally, we’ll explore the copyright implications of making books that are 150-years-old available on the web.
The British Library has taken steps towards digitisation in the past – go to www.bl.uk/ onlinegallery /ttp/ttpbooks.html, for example, and you can flick through the pages of the original Alice’s Adventures in Wonderland or Mozart’s Musical Diary. However, this is the first time the Library has ever undertaken a process of mass digitisation, and the logistics of scanning shelves upon shelves of books is infinitely more complicated than hand-picking a selection of rare texts.
As a result, it needed a helping hand with the technology. Microsoft and Google have both been digitising books from US libraries and adding them to their rival online services for some time, and the British Library opted to partner with the former. Choosing Microsoft as a partner for any archiving project brings not only a wealth of experience and financial clout, but also a degree of controversy, partially due to Microsoft’s long-held affection for closed standards. Is there not a danger that Britain’s literary heritage will end up wrapped in Microsoft’s closed formats?
It’s an accusation the Library is quick to refute. The books are being digitised and output in a variety of non-Microsoft formats, including JPEG 2000 and plain text. The Library is also using known standards, such as Metadata Encoding and Transmission Standard (METS) and Analysed Layout and Text Object (ALTO), in the collection of metadata to ensure the venture is consistent with other digitisation projects and will remain readable in the long term. What’s more, the Library retains the rights to all the data being collected.
So what does Microsoft get out of the deal? In return for financing the project, Microsoft can host the collection on its Live Search Books site (http://books.live.com), and will probably have the collection live before the Library manages to update its own website. The Library wouldn’t reveal for how long Microsoft has the licensing rights, but for such an expensive project it’s safe to assume this isn’t a short-term deal.