31 January, 2013
The good news about the Internet and its most visible component, the World Wide Web, is that there are hundreds of millions of pages available, waiting to present information on an amazing variety of topics. The bad news about the Internet is that there are hundreds of millions of pages available, most of them titled according to the whim of their author, almost all of them sitting on servers with cryptic names. When you need to know about a particular
subject, how do you know which pages to read? If you’re like most people, you visit an Internet search engine.
Internet search engines are special sites on the Web that are designed to help people find information stored on other sites. There are differences in the ways various search engines work, but they all perform three basic tasks: they search the Internet — or select pieces of the Internet — based on important words, they keep an index of the words they find, and where they find them, they allow users to look for words or combinations of words found in that index.
Early search engines held an index of a few hundred thousand pages and documents, and received maybe one or two thousand inquiries each day. Today, a top search engine will index hundreds of millions of pages, and respond to tens of millions of queries per day. In this article, we’ll tell you how these major tasks are performed, and how Internet search engines put the pieces together in order to let you find the information you need on the Web.
A search engine is a searchable database of Internet files collected by a computer program, called a crawler, robot, worm, or spider. Indexing is created from the collected files, e.g., title, full text, date last modified, language, etc. Results are ranked by relevance; this will vary among search engines.
In essence, a search engine consists of three components:
• Spider: Program that traverses the Web from link to link, identifying and reading pages
• Index: Database containing a copy of each Web page or other file gathered by the spider
• Search and retrieval mechanism: Technology that enables you to search the index and that returns results in a relevancy-ranked order