A quote from 1958

Technology, so adept in solving problems of man and his environment, must be directed to solving a gargantuan problem of its own creation. A mass of technical information has been accumulated and at a that has far outstripped means for making it available to those working in science and engineering.

Charles P. Bourne & Douglas C. Engelbart, 
SRI Journal, Vol.2, No. 1, 1958


Search – it started earlier than you think.

A very brief history of search

In this post Martin White describes the history of search. It began earlier than you think…

Intranet Focus provides information management and intranet management consulting services. They also regularly publish a Research Note packed with great stuff.

In the November issue of their Research Note, there is an interesting piece on the history of Search. Martin White, the Managing Director, has granted me permission to publish it here (see below).

By the way – Martin has recently published a book –
Enterprise Search: Enhancing Business Performance.

It’s certainly on my Christmas list this year…

A very brief history of search

Search came into prominence with the advent of the web search services in the 1990s, notably Alta Vista, Google, Microsoft and Yahoo. However the history of search technology goes back much further than this. Arguably the story starts with Douglas Engelbart, a remarkable electrical engineer whose main claim to fame is that he invented the mouse that is now a standard control device for personal computers. In 1959 Engelbart started up the Augmented Human Intellect program at the Stanford Research Institute in Menlo Park, California. One of his research students was Charles Bourne, who worked on whether it would be possible to transform the batch search retrieval technology developed in the 1950s into a service based on a large mainframe computer which users could connect to over a network.

By 1963 SRI was able to demonstrate the first ‘online’ information retrieval service using a cathode ray tube (CRT) device to interact with the computer. It is worth remembering that the computers being used for this service had 64K of core memory. Even at this early stage of development the facility to cope with spelling variants was implemented in the software.  Other pioneers included System Development Corporation, Massachusetts Institute of Technology and Lockheed. The main focus of these online systems was to provide researchers with access to large files of abstracts of scientific literature to support research into space technology and other large scale scientific and engineering projects.

These services were only able to search short text documents, such as abstracts of scientific papers. In the late 1960s two new areas of opportunity arose which prompted work into how to search the full text of documents. One was to support the work of lawyers who needed to search through case reports to find precedents. The second was also connected to the legal profession, and arose from the US Department of Justice deciding to break up what it regarded as monopolies in the computer industry (targeting IBM) and later the telecommunications industry, where AT&T was the target. These actions led IBM in particular to make a massive investment into full-text search which by 1969 led to the development of STAIRS (Storage and Information Retrieval System) which was subsequently released in 1973 as a commercial IBM application. This was the first enterprise search application and remained in the IBM product catalogue until the mid-1990s.

One of the core approaches to information retrieval is the use of the vector space model for computing relevance developed by Professor Gerald Salton of Cornell University over a period of two decades starting in 1963.  The vector space model procedure uses a cosine vector coefficient to compare the similarity of the content of the document to the query terms. This is the basis for most of the enterprise search applications with the notable exceptions of Recommind (which uses Probabilistic Latent Semantic Indexing) and Autonomy.

In 1984 Dr. Michael Porter, at the University of Cambridge, wrote Muscat for the Cambridge University MUSeum CATaloguing project. Over the ensuing decade this software was arguably the first to use probability theory in natural language querying, focusing on the relative value of a word – either in the search expression, or in the document being indexed. Identifying links and correlations between significant words that co-exist together across the whole document collection creates a probabilistic model of concepts. Using a probabilistic approach to determining relevance dates back to research undertaken at the RAND Corporation in the late 1950s and by the late 1980s there was a substantial amount of research into the use of Bayesian probability models for information retrieval.

The history of Autonomy dates back to the formation in 1991 of Cambridge Neurodynamics by Dr. Mike Lynch. Cambridge Neurodynamics used neutral network and pattern recognition approaches to fingerprint recognition. In 1996 Dr. Lynch founded Autonomy together with Richard Gaunt with $15 million in funding from investors including Apax Venture Capital, Durlacher and the English National Investment Company (ENIC). The novel step was not just the use of Bayesian statistics but the combination of these statistical approaches with non-linear adaptive signal processing (used by Cambridge Neurodynamics for analysing fingerprint images) of text.  For that time the level of investment in a company with no commerical track record was quite remarkable. In 1998 the company was floated on EASDAQ which capitalised the company at around $150 million, and its shares rose quickly from $15 in October 1999 to $120 in March 2000. This valued the company at over $5 billion.

The company was floated on the London Stock Exchange in 2000, and became the only publicly-quoted search company in the world. This was important for procurements in both the corporate and public sector given that all other search companies remain privately held and do not disclose earnings and profits other than under a non-disclosure agreement with a prospective customer.

Latent Semantic Indexing dates from the late 1980s and Probabilistic Latent Semantic Indexing from the late 1990s and among other features provide solutions to the issues raised by different words having the same meaning and the same word having different meanings.

A big thanks to Martin for this information, and for bringing to my attention the names of Gerald Salton, and Douglas Englebart. I recommend that you click on the below links and read more about the fascinating work that these two have done.

I also highly recommend that you checkout  Intranet Focus’s site, and read some of the great stuff there. 

Recommended Reading
  • Gerald Salton (Wikipedia)
  • Douglas Englebart Institue (website)
  • Intranet Focus (website)
  • Martin White (Goggle hits)
  • Enterprise Search: Enhancing Business Performance.