When Tim Berners-Lee was working on the ideas that were to become the World Wide Web, he called his first prototype browser Enquire, short for ‘Enquire Within upon Everything’ – a Victorian encyclopedia that had caught his imagination as a child. His original vision of the Web as a universal and freely accessible research tool was pretty far-reaching, but the actuality quickly outgrew even his vision. The amount and range of information out there nowadays is almost unimaginable, which raises the problem of how an end user can find the information they’re looking for and, from a publisher’s perspective, how can you make sure you’re reaching the widest possible audience?
Clearly, users needed a helping hand and the first major navigational guide to the Web, Yahoo!, was developed by two Stanford University students David Filo and Jerry Yang back in 1994, beginning life as little more than a list of personal bookmarks that turned into an extensive categorised directory.
Directories like Yahoo! and the Open Directory Project at dmoz act like contents pages, pointing users in the right area, but what you really need is a full index that takes you directly to the page you’re looking for. Indexing the Web’s ever-growing and changing content manually would be impossible, which is where search engines come in. Early pioneers like the World Wide Web Wanderer and WebCrawler invented the idea of data-collecting ‘spider’ programs that automatically crawl down all the links they find, adding the text of pages they traverse to their databases – users can then enter the keywords to search for and a Search Engine Results Page (SERP) of matching links will be automatically created. The boost to browsing efficiency was such that a new breed of commercialised search engines, including Lycos, Excite and AltaVista, took over as natural ‘portals’ for the vast majority of web traffic. For publishers, a high ranking on a popular keyword-based search guaranteed serious traffic, but if your page didn’t appear on the first SERP returned the number of visitors you could expect tailed off dramatically.
Search engines quickly came to control the majority of web traffic, but just how effective can such automated indexing systems be? Defining relevance simply as the number of times a keyword appears on a given page is clearly inadequate, since longer pages are likely to produce more keyword matches. To improve the quality of searches, each engine developed its own algorithm for estimating relevance, based, for example, on keyword density, proximity or prominence rather than absolute number. However, even with a high keyword rating, a page might still be about a different subject entirely. To further improve the quality of results, these search-engine algorithms needed to look beyond straight textual analysis and take other relevancy factors into account.
Here the nature of HTML itself comes into play, because Tim Berners-Lee specifically designed the language for marking up certain kinds of content. In particular, if a keyword appears within a <TITLE>, an <H1> to <H4> heading or a <STRONG> tag, or the ALT description of an <IMG> tag, it’s reasonable to assume that it’s significant to the page. In addition, HTML enables descriptive information about each document to be included in the page’s <HEAD> tag, and many early search engines placed great importance on these <META> keywords and descriptions that authors were encouraged to include.