Searching for Facts in all the Wrong Places

DARREL RAYMOND
Consultant The Gateway Group
Waterloo, Ontario
Canada

With the explosion of information on the World Wide Web and corporate intranets, the need to search for information has never been more important. What’s more, searching is increasingly an activity for technical professionals of every stripe, not just researchers and librarians.

Not surprisingly, various search technologies have pros and cons. Web search engines and product-data-management packages may misname simple search techniques (such as keyword search) with more impressive monikers (concept searching). Familiarity with the differences in such methods, if nothing else, helps avoid frustration during searches.

Some search technologies are useful for expanding the set of documents in the solution; some are useful for restricting it. No one search technology is appropriate for every need.

More complex technologies are not necessarily better. Natural language understanding and concept searching can potentially simplify a user’s life, but both promise more than they often deliver. In addition, users of these methods don’t have a simple, clear model of how the system is finding documents, and so can’t tell if the system is operating correctly or whether it is indeed retrieving all relevant documents.

Similarly, full-text and phrase-searching systems might appear to be best, because “they index everything, and it seems that you can’t do better than that.” But text systems index only the text of the document, ignoring additional information that might be provided by categorization. Full-text searching also places a burden on users, who must think of all possible word and phrase variants to locate all the relevant documents.

Keyword-based systems are simple and cheap, but to work well require a consistent indexing strategy. This may involve the use of humans to categorize documents, or may require authors to provide keywords, descriptive titles, and good abstracts.

Traditional databases are highly structured and focus on numbers, which have an exact meaning. Document-searching technologies, on the other hand, must focus on words. But words derive much of their power from being elusive, ambiguous, and open to reinterpretation. It should not be surprising, then, that document searching is inherently an incomplete and approximate enterprise.

Basics of Searching
All document-search techniques have the same basic structure. A user specifies a search query, a description of an ideal document that would satisfy the need for information. The database contains document descriptors for each indexed document; the actual search compares the query description against the document descriptors, collecting those that match.

Searching technologies differ in three main ways: The kind of information used to describe documents; the rules that decide when a query description matches a document descriptor; and the speed with which matching and updating of the database can take place.

Keyword Searching
In keyword searching, a set of keywords describe documents, and the user enters a query that consists of keywords. The search engine records a match if a document descriptor contains these words.

Keyword systems work best when the user specifies words that are highly selective — they occur infrequently in the whole collection of documents, but occur frequently in the documents the user is interested in. Words of low selectivity are often called stop words. Examples include “is,” “to,” “and,” “the,” or other frequently used words. Even fairly specific terms can be stop words. For example, in the documents of a steel company, “steel” would be a stop word because it shows up frequently (and hence has low selectivity).

Some keywordbased systems employ a restricted or controlled vocabulary; they describe documents using words from that vocabulary, and users can consult the vocabulary to find words for querying. Other keyword systems approximate this by using only words found in document titles and abstracts.

Concept searching is an enhancement of keyword searching. Concept searching engines use a thesaurus to expand the set of search terms the user provides, trying to find more potentially relevant documents. Some concept searching engines also make use of morphological or grammatical knowledge to search for plurals and grammatical variants.

Concept searching is useful for situations in which your information needs are somewhat vague, or when you have run out of ideas for search terms. The basic problem with concept searching is that people generally have different interpretations of a given concept, and their interpretations change depending on their information need.

As a searching technology, keywordbased retrieval is relatively well understood and can be efficient. It is not hard to update a keyword-based index. Keyword-based searching is effective if each document descriptor has enough keywords.

Boolean Searching
Some keyword searching systems permit Boolean searching, which lets users specify some words as alternatives and that some words should not be in the document descriptor. The three Boolean operators are AND, OR, and NOT. As an example, “bodkin AND uncle AND Denmark” might select the play Hamlet, while “bodkin OR uncle” would find other plays that have uncles in them.

Boolean searching seems simple, but people often misuse it. Statistical analysis shows that AND is too powerful at reducing matches. Human factors research shows that most people cannot properly pose a query involving NOT. OR is relatively safe to use. Another problem with Boolean querying is that it can be relatively complicated. This is particularly so for posing a Boolean query that searches for documents matching only a subset of keywords.

Weighted Searching
Weighted searching is the main alternative to Boolean searching. Instead of specifying that a document contain “Hamlet AND uncle AND bodkin,” you assign weights to the search terms, as in: “Hamlet 0.95, uncle 0.5, bodkin 0.78.” The search engine uses the weights to determine the relative importance of the query words.

Some search systems also weight the words used as document representatives. Weights given may be based on a word’s selectivity, its frequency in the document, or other properties.

The basic problem with weighted searching is that it’s hard to understand what the weights really mean. We know what it means for a document to contain the word “Hamlet.” But what does it mean for it to contain “Hamlet” 0.75? Furthermore, it is possible that varying the weights only slightly may lead to a completely different solution.

Similarity-based searching and fuzzy match retrieval are variants of weighted searching.

Full-text Searching
A keyword-based system indexes a few words or representatives of a document. A full-text retrieval system indexes the whole text. An important virtue of a full-text index is that you need not worry about what words to search on — the whole document is indexed. Another advantage is that a full-text system may index many fragments of text that a keyword-based system would not (such as numbers, dates, prices, and punctuation).

Full-text systems have disadvantages. Indexing is slower, because they must process significantly more text. There is generally a need for file format converters to extract text from different formats. The size of the index is large, maybe even larger than the documents themselves. From the standpoint of maintenance, the updating of the index is often a costly activity as well.

Some full-text systems support phrase searching, where you can look for phrases in addition to individual words. Besides searching for “Hamlet” or “bodkin,” for example, you could also search for the phrase “shrug off this mortal coil.”

The basic advantage of phrase searching is its greater degree of selectivity. Many words that are not particularly selective by themselves become extremely selective when combined as a phrase. The basic problem with phrase searching is that it is even more restricted than a Boolean AND query.

Phrase searching is complicated to implement because phrases overlap in a text, whereas words do not. Phrase-searching indexes are generally larger than full-text indexes. Phrase searching engines cannot discard stop words, because these words gain significance in a phrase. The phrase “to be or not to be,” for example, consists completely of stop words, but is highly significant.

Proximity searching is a kind of fuzzy phrase searching. A garden-variety phrase search gives exactly the words you want, in an exact order. In contrast, a proximity search specifies one or more words that should be close to each other. An example of a proximity search is “shuffle NEAR coil.” Some proximity searching systems let you specify the width of the range in characters or words.

Proximity searching is a bit like a weighted search on word positions. Searching for “government corruption” (a phrase search) will retrieve some documents, but searching for “government NEAR corruption” will find more. Proximity searching works well when phrases tend to have many variants, or when words of moderate selectivity tend to sit close to one another. Unfortunately, proximity searching is expensive in computer time.

Ranging Searching
Range searching is possible when documents are represented by values that can be ordered. An example is “documents published between January and June of last year.” Any values chosen from an ordered domain — including time, money, revisions, and dimensions — can be the subject of a range search. Range searching is generally expensive to implement in document managers, though it is a staple of relational database systems. In addition, proximity searching can be implemented as a kind of range search.

Two documents are said to be bibliographically coupled if there is a third document that links to both of them. The earliest use of bibliographic coupling was for indexing academic papers through their bibliographies, hence the name. The basic idea is that if two papers are referenced from a third, there is some evidence to believe that the two are related (otherwise, the author would not have referenced them both).

Bibliographic coupling is uncommon in document management systems, but is beginning to appear in systems for searching the Web.

Some research tools to search the Web are also offering cocitation, a companion to bibliographic coupling. Two documents are considered related by cocitation if each references a third document. As with bibliographic coupling, cocitation is an information retrieval technique originating with academic papers and bibliographies. In scientific fields, the fact that two publications jointly reference a third is evidence that both are related by academic “pedigree.”

Relevance FeedBack
One recurring problem with search systems is in getting users to specify the right words or other document representatives. Relevance feedback tries to address this by using documents themselves as queries.

Relevance feedback is sometimes called “query by example.” Users browse a database until they find one or more documents that seem appropriate. Then they pose a query to the system that says, in effect, “find more documents like this.” The search engine then extracts from the document its keywords, title, full text, or other representatives, and treats these as the input for more searching. Systems based on relevance feedback may use Boolean, weighted, full text, proximity, or range searching.

The main advantage of relevance feedback is simplicity for users. The main disadvantage is that the search mechanism is a black box; you really have no idea how the system picks documents similar to the one you gave as an example. Consequently, relevance feedback techniques are useful mainly when you are out of ideas for locating more relevant information.

Natural Language
The goal of natural language retrieval is the ability to pose questions in natural language to a computer, just as one would pose them to a human (such as a researcher or librarian). The obvious virtue of the scheme is a completely natural user interface. The difficulty is that the riddle of fully understanding natural language remains largely unsolved. Simple ambiguities confuse computers, and relatively little progress has taken place on understanding language.

Most systems claiming to do natural language searches are essentially extracting keywords out of a natural language query, then using these for a weighted search.