A couple of years ago, you’d still hear arguments about whether or not electronic readers could ever take over from print-on-paper. That already feels like a long time ago. I found myself in a mid-market hotel just before Christmas, and when I came down for breakfast there was someone (or a whole family) reading news from an iPad at every table, except for the one that had a Kindle. I had to make do with my Android phone and felt a little out of place.

My Sony Reader hasn’t made the grade. Its page turning is just too slow and it’s too much of a drag to download content to it, but the final straw was the way it handles different eBook formats – unpredictably and far from gracefully. For example, the Sony Reader displays PDFs unpredictably: it has only three text sizes and if you’re unlucky, none will look right.

My gut feeling is that, Kindle notwithstanding, none of the current eBook formats will be the eventual winner

Commercial books properly formatted in ePub look fine on the Sony – cover, contents and navigation – but I rarely read them. The problem is that I never buy these kind of books from Waterstones or publishers’ websites. What novels I do read, I still read on paper (possibly decades or centuries old), and the rest of the time I read non-fiction that’s rarely, if ever, available as an eBook. Perhaps a third of my reading is done from a webpage – including The Guardian website, Open Democracy, Arts & Letters, blogs and whitepapers from numerous tech sites – using a laptop or phone.

I have, however, collected an extensive library of classic texts and reference works that I use a lot, stored on my laptop to always be available offline, and it’s there the Sony really fell down. I get most of these books from the Internet Archive where they’re typically available in several formats: PDF and PDF facsimile (scanned page images), ePub, Kindle, Daisy, plain text and DjVu (an online reading format).

The Internet Archive is a non-profit organisation that relies on voluntary labour to scan works, so inevitably, most documents are raw OCRed output that hasn’t been cleaned up manually. Really old books set in lovely letterpress typefaces such as Garamond and Bodoni are the saddest, because Optical Character Recognition (OCR) sees certain characters as numerals, so the texts are peppered with errors such as “ne7er” and “a8solute”. Many such books also contain a lot of page furniture – repeating book and chapter titles in headers or footers – that scanning leaves embedded throughout the text, which is extremely irritating if you consult them often.

Facsimile versions

One solution is to download a facsimile version, but that’s glacially slow to read on the Sony Reader, taking ten seconds to turn each page and looking crap in black and white. On a laptop or iPad, in colour, it’s a fine way to read (even preserving pencilled margin notes), but it isn’t searchable – so I always have to download a text-based PDF or plain-text version too.

I even started cleaning up certain books myself, downloading a plain-text version and using Microsoft Word (of all things), which actually has powerful regular expression and replacement-expression facilities, although these are well-hidden and have lamentably poor Help files.

I soon learned how to quickly bulk-remove page numbers and titles, auto-locate and reformat subheads, and even cull improbable digits-in-the-middle words such as “ne7er”. However, outputting the cleaned-up result as PDFs proved a lottery on the Sony as regards text sizing, contents page and preserving embedded bookmarks (you need one per chapter for navigation purposes). For many books, I found that an RTF version actually looks and works better than a PDF.

Someone tipped me off to try Calibre, a shareware eBook library manager that converts between different eBook formats and, in particular, can output in Sony’s own LRF file format, which proved more reliable than PDF. It was already too late for me, though. Calibre works well, but is quite techie to use and, like Sony’s proprietary Reader software, it maintains its own book database, so there’s another file system to deal with.

Eventually, I just couldn’t be bothered. I checked Google Books’ offer of a million free-to-download public domain titles, only to discover that they’re the same copies I already have from the Internet Archive (once some book geek has scanned Santayana’s “Egotism in German Philosophy”, no-one else will).

My gut feeling is that, Kindle notwithstanding, none of the current eBook formats will be the eventual winner and that plain-old HTML, in its 5 incarnation, will become the way we all read stuff on our tablets in a few years. Perhaps PDF, too.

Whatever wins, we’ll need to recruit a second generation of volunteer labour to clean up all those documents scanned by the first generation once the Google book project hits its stride. Does anyone fancy signing up for the digital bookworm equivalent of toiling endlessly away in the cane fields?

