Current selection:

Advanced Search     Switch Selection     Classical books index     Preferences

Precision Search

Which would you prefer in a search result:

For people who value their time, obviously the second option is preferable. When searching for exact phrases or single words, most search engines deliver the precision that we crave. If only that were so for the more common multi-word searches!

Precision search describes a design in which most false hits are filtered out by the computer, not by the person searching.

Precision in all searches would benefit everyone. Here are four keys to precision:

Design search engines to handle smaller units of search.

The larger the number of words in a text selection, the higher the probability that a given word will occur.

Here is an experiment. On a major Internet search engine, search for three words such as Shakespeare, light, and shade. Examine the list of the first 10 hits, and note the sizes of the records. Here are the sizes that I found (measured in kilobytes): 494, 105, 10, 30, 38, 22, 28, 49, 13, and 100 ... average 89 kilobytes per record. The first (and presumably best) hit contained 11,274 words. It was a list of music artists. Whether it included "kitchen sink", I don't know. But it had just about everything else. Of course, the first hit had nothing whatsoever to do with Shakespeare's theme about light and dark. Other hits included a list of events at a park, a dictionary entry (without the word Shakespeare anywhere in view), a concordance index, and a book catalogue. Lots of words. All but one in the first ten were false hits.

Designating entire web pages as records / units of search / potential hits is a sure guarantee of false hits -- records where the selected words do (with some search engines, make that "may") appear. But the words are far removed from each other, and totally unrelated to each other. False hits are the bane of the major search engines; their results lists are bloated with all manner of irrelevant content. It is left to you as searcher to filter out all the garbage hits.

A web page is too large a target for search.

The solution to the problem of false hits is to move to smaller, more meaningful units. In normal writing, the paragraph is a unit of meaning. In Words Close Together technology we use either a paragraph or a series of short paragraphs as the basic unit of search. The effect is that the program automatically weeds out many records that would otherwise show up as false hits.

Let the person who searches tighten the targets further.

The Words Close Together search engine discards all hits in which the desired words are too far apart. If over 99 words intervene between the words that you want, the words are too far apart, and it would be meaningless to offer that section of text as if it were a good search result. The ceiling is 99 intervening words. Even within a paragraph, words need to stand in some relation to each other to have meaning.

The user can raise or lower the count of intervning words. Want a very tight search, and fewer false hits? Lower the ceiling to 12 or 5 intervening words (words that you don't want that occur between the words you do want). The minimum is zero intervening words. Want more hits? Raise the ceiling anywhere up to 99 words to be allowed in between the words you want. In either the Internet or the desktop version, go to Preferences and set a value.

Design search engines to stick rigorously to the rules of logic.

In an attempt to show more hits, early search engines would show as successful hits results that contained only some of the words requested. If you are searching for combinations of five words -- teen, summer, music, camp, ohio -- do you want to be shown content related to teen music in Lesotho? The five word request is for a Boolean AND. By showing only some of the desired terms, a search engine is loading you with false hits, and disregarding the rules of search. The extra hits do not help the person searching.

A more serious failing: If a word occurs nowhere on a page, but only on a page linked to that page, should the first page be reported as a successful result? Anybody who thinks that is okay needs to go back to Sesame Street and Grover's endearing lesson on "near" versus "far". A word that is found on a totally separate page is "far". Our view on search logic: For the sake of meaningful results, let's focus on words that are "near" or, to coin a phrase, "words close together".

Bloat is bad, small is good, miniature is even better.

Much software written today would be better termed bloatware. The thinking of the designers appears to be that, if using some resources is good, then pouring on many more resources is even better. That might have an element of truth for designing word processors. For designing search, the bloat mentality is a disaster.

Words Close Together technology is built upon a method of compression indexing. A WCT index is very compact -- typically 30 to 45 percent of the size of the text which it indexes. That index contains every instance of every word. The power of miniaturization is the key factor in the search engine's speed. It is why Marpex stands alone in offering high quality proximity search in very large sets of data.


 
Meaningful  precision  search  for text data  is available now.  Learn more.
 
words close together.com The "Research Quality" Search Engine by Marpex, Inc.