NOTES ON FREE TEXT INFORMATION RETRIEVAL

by Mark Zimmermann, March 1990

I've been thinking recently about some fundamental issues in big free-text databases ... specifically, whether it will be feasible to handle hundreds of MB/day of unstructured text (from wire services, USENET, database downloads, scanned-in magazines, etc.) and do useful research on it, the way that my FREETEXT indexer/browser programs let you work with single files of tens of megabytes. I think that the answer is "yes" --- 100 MB/day is manageable (for how long? --- a month, at least?). But it will require some (straightforward, I hope) extensions to my current indexer/browser systems.

This informal memorandum is a summary of my thoughts. It includes comments on required features of the software, the technical design decisions behind my current information retrieval (IR) codes, key problems that must be solved to handle bigger free-text databases, and some suggestions for how to implement solutions to those problems. My goal is to provide a record of where my general free text IR thoughts stand today, to help people who want to work with me on enhanced text handling tools for researchers. I'd appreciate any comments or suggestions or corrections that anybody wants to make!

REQUIREMENTS

I want real-time high-bandwidth responses to simple, common user queries of the free-text database. By "real-time" I mean that single-term retrieval requests should be answered in less than a second. (Complex Boolean proximity searches can take a bit longer to set up and execute, but even there, speed is important.) By "high-bandwidth" I mean that the human has to get information back from the computer in a useful form which can be evaluated quickly --- it's not good enough to return a list of documents or files, each of which has to be paged through in order to find the few relevant items. By "free-text" I mean that the input stream can't be assumed to include any regular structure beyond words separated by delimiters (spaces, punctuation, etc.).

I want tools for the earliest stages of research, when a person needs to be able to browse and free-associate without even knowing what are the right questions to ask. Users have to be able to work with the database without intimate knowledge of the details of what's in it (since nobody will have time to read much of a 100 MB/day flood of information). A big database in a particular area has to be a "corporate memory" for groups of scholars who are working on related topics. A new person has to be able to come in and do research without extensive training, and an experienced user has to find the system transparent, so that it becomes an extension of her memory, not a barrier to getting at the data and thinking with it. I see free-text IR tools as coexisting with other types of tools, such as structured databases which are appropriate for answering well-formulated queries at later phases of the research effort.

I know of four fundamental end-user operations that a good real-time high-bandwidth free-text IR system must support:

viewing lists of words that occur in a database;
selecting subsets of a big database to work within;
browsing through candidate items efficiently; and
reading and taking notes on retrieved information.

I'll discuss each of these four tasks below.

WORD LISTS

Viewing lists of words that occur in a database is exceedingly useful, for many reasons. An alphabetized word list makes it easy to spot many misspellings and variant suffixes for search terms. Word lists are surprisingly useful in working with mixed-language databases, particularly in spotting cognates and words that begin similarly to related native-language terms. The ability to view a word list, along with numerical information about occurrence rates in the whole database and in subsets, is valuable in formulating better queries that maximize good data retrieval. Perhaps most importantly, just being able to scroll around in a list of terms is a superb aide-memoire in thinking of words worth looking for in a database. Visible word lists are much better than blind searching, even when wild cards are permitted. It's precisely analogous to the "recognize vs. remember" dichotomy between menu-driven and command-line user interfaces.

Word lists are easy to generate, provided that one has built a complete inverted index to the database being browsed. For example:

       1    8805251042
       1    9511
    6774    A
       1    AA01693
       1    AA03935
       2    AA13961
       1    AA29900
      34    ABLE
     291    ABOUT
      29    ABOVE

Each word is preceded by the number of times it occurs in the database as a whole. Note the presence of numbers and document identifiers as well as ordinary words.

A word list doesn't need to take up much storage space as files get big, since the number of distinct words grows much slower than the file size. There are various rules of thumb and theoretical formulae, such as Zipf's Law, for estimating how big the word lists will get. In my experience, unique dates, other numbers, garbles, typos, etc. tend to keep the word lists growing more than expected, but not at an unmanageable rate.

My free text IR programs use an extremely simple file structure, which makes generating word lists nearly trivial. In the "keys" index file, I store as an array the words found in the database along with count information. Thus, a single disk access can fill a small window with an alphabetized word list along with each word's frequency of occurrence in the database as a whole. (See the discussion under BROWSING below for information on how counts are displayed when working in a subset of the database.)

In my programs, I currently truncate words rather arbitrarily after 28 characters; I keep 4 bytes of cumulative count as a long integer; I default to a standard character mapping from lower-case to upper-case; I default to break words apart at punctuation marks; and I default to alphabetizing according to the ASCII character set order. These choices work well for English-language texts in my experience, and they're easily changed if necessary by recompiling the indexing and browsing programs. (In a future version of my programs, I intend to allow at least the character mapping and alphabetization order to be defined separately for each database; see the EXTENSIONS AND ENHANCEMENTS section below.)

These design choices mean that each distinct word in a database occupies 32 bytes in the "keys" index file. It is possible to save a little space by using a more complex data structure, but it would be at the cost of speed and simplicity in the indexing and browsing programs. In fact, as databases get larger the storage used by the "keys" file becomes increasingly insignificant compared to the database and the other part of the index (the "pointers" file, discussed under SUBSETS and elsewhere below). The four-byte cumulative count element of the data structure will limit the indexed databases to 2^32 total words. That's about 4 billion words, or roughly 25 GB total file size, given a typical word length of 5 letters plus 1 delimiter. Modifications to this part of the index data structure will thus be necessary for very large free-text databases (see EXTENSIONS AND ENHANCEMENTS).

I index every word in the database, even common words such as "a" and "the". Many free-text IR systems call such terms "stop words" and omit them. The result is a saving of perhaps 20% - 40% in final index file size, depending on how long the stop word list is. The cost, however, is too great --- it becomes impossible to search for common terms in proximity to each other or as components of names, part numbers, dates, and so forth. A user can never find "To be, or not to be..." in Shakespeare, since every word in that phrase is a stop word! It also becomes impossible to look for "A. I. Smith" in a long name list, or to find many short Chinese or Japanese words at all.

I am inalterably against the use of stop words; I also oppose "stemming" or "truncation" in free-text indexing. Many conventional systems attempt to remove suffixes so that they can index various forms of a word under a single entry; for instance, "compute", "computer", "computers", "computational", and "computing" might all be merged into a single "comput" entry. This saves a small amount of space, but at the expense of simplicity and control for the user. (At times, I may need to retrieve "computers" but not "computer".) Many stemming systems also get confused easily and garble words (particularly proper nouns) --- occasionally with disastrous results when a truncated term lands on the stop word list and is left out of the index entirely! Foreign terms and oriental names are particularly vulnerable to butchery by such systems.

SUBSETS

Researchers need to be able to select subsets of a big database to work within. As information collections get very large, there will still be unique or nearly-unique terms that people will want to retrieve quickly --- the last occurrence of a particular name, a bizarre word or document identifier that only occurs once, etc. But increasingly often, there will be too many instances of an important word for convenient browsing (but see the BROWSING section below for methods to delay that point). It thus becomes necessary to filter the input stream in a fast and flexible way, so that a human- usable volume of information is left to study.

Proximity search is a common free-text IR approach to data filtering. The usual proximity criteria, however, tend to be narrow, inflexible, and (in my experience) extremely hazardous to retrieval when applied to real-world databases. A conventional form of proximity search is to enter a boolean query such as "FREE(w2)TEXT AND (INFORMATION OR RETRIEV*)" which translates into "give me any document with FREE occurring within two words before TEXT, which also contains INFORMATION or any word beginning with RETRIEV". Of course, this example can't work at all if the database isn't already divided up into "document" units. Even if the query appears to work, the user will fail to see an unknowable number of near-miss items which contain some but not all of the specified words, or which use alternative spellings (including typographical errors).

The user interface for conventional boolean proximity search is simply terrible. It doesn't provide enough feedback to let the user avoid subtle (or obvious) mistakes, it requires too much human memory and creative energy in query formulation, and it gets in the way of seeing the data. Often, an apparently-innocuous term in a parenthesized boolean search statement actually distorts the entire meaning of the query and eliminates all the useful information that was to be retrieved. Expert searchers can overcome some of the barriers that a command-line proximity search interface imposes, but it's never easy.

In my browser programs, I take a different approach. I let a user define a neighborhood of interest in a free-text collection based on loose proximity to words chosen from the full database word list (see WORD LISTS above). For reasons of speed, simplicity, and convenience, the subset proximity criterion is in bytes (characters of text), with a default resolution of 32 bytes. Thus, a searcher could ask a fuzzy equivalent of the above sample boolean query by starting out with an empty subset and then adding the neighborhood of the word "FREE". This is actually done by clicking on the word "FREE" as it is displayed in an Index View window -- a perfect opportunity to see nearby alternative search terms! The default neighborhood is "within-a-few-words", roughly plus or minus 50 bytes. At this point, an excerpt from the database subset's word list might be displayed as:

 
     5/34     ABLE
     0/291    ABOUT
     0/29     ABOVE
     0/5      ABSTRACTS
     9/41     ACCESS

(Here, the term "ABLE" occurs 34 times in the database as a whole, of
which 5 instances are in the proximity neighborhood of "FREE".)

Next, after making a similar selection of the neighborhood around the word "TEXT", the user can narrow down the subset of interest to the intersection of the "FREE" and "TEXT" subsets. (This equivalent of the boolean AND operation is again done with a mouse click.) Then, broadening the neighborhood setting to 500 bytes or so, the user can with a few more clicks define another subset of words very loosely near INFORMATION or RETRIEVAL or RETRIEVALS or RETRIEVING, etc. Intersecting that subset with the FREE TEXT subset gives the final subset of interest for interactive browsing.

My experience, and that of many users, is that my kind of fuzzy proximity search is far better than classical boolean search for finding the information that one really wants. It's faster, safer, and far more aesthetic to use as well. The mental model that the user has is simple and powerful: neighborhoods of influence that radiate out from interesting words, and the possible combinations of those neighborhoods. The proximity search interface, after a few minutes of use, becomes transparent and doesn't get in the way of thinking with the data.

My current indexer/browser programs assume that the free-text database is a single file. That assumption is straightforward to lift, and I plan to do so soon (see EXTENSIONS AND ENHANCEMENTS). But it is very convenient, for many purposes, to treat the entire database as a linear entity, and refer to byte offsets from the beginning of the database. Databases which have additional structure (sections, chapters, pages, etc.) can easily be handled within such a framework.

My current implementation of fuzzy proximity search uses vectors, linear arrays, to represent the database subsets. Each bit of a subset vector stands for a 32-byte (default value) section of free text. A bit is cleared when its section is not included in a given neighborhood, and a bit is set when its section is part of the subset. This approach of representing 32 bytes with 1 bit results in a compression factor of 256, so a gigabyte database can have a proximity neighborhood defined using a 4 MB vector. Subset operations are trivially simple with this data structure. (See the EXTENSIONS AND ENHANCEMENTS section for a discussion of what changes may be needed to handle multi-gigabyte databases.)

In order to set or clear bits in defining subsets, it is obviously necessary to have pointers to the locations of every word in the database. (It is necessary to have such pointers in any event to retrieve information quickly, of course.) The second index file that my programs use, the "pointers" file, is precisely that. In the current implementation, it simply consists of 32-bit integers giving the offset from the beginning of the database to each indexed word. Entries in the pointer file map directly to proximity subset vectors, making it easy to set and clear bits as needed to define regions of interest in the database. (See the EXTENSIONS AND ENHANCEMENTS section for more on how this model can grow to handle larger data collections.)

Each instance of a word thus requires 4 bytes of space in the "pointers" file. Since an average word is 5 letters long, plus one or more terminal spaces or other delimiters, the average "pointers" file is less than 80% of the size of the original database text file. It would certainly be possible to save some disk space by compressing the "pointers" file, but I doubt that more than half of the original file size could be recovered. There would also be a price to pay, in terms of slower retrieval, more severe database size limitations, and increasingly complex (and less reliable) software. I thus chose to keep the "pointers" file structure as simple as possible.

Many conventional free-text IR systems, I suspect, generate in-memory lists in order to do proximity search operations. They tend to fail when looking for common words in conjunction with each other, probably when their data structures overflow. (I don't know for sure, as their algorithms are proprietary and unpublished; I deduce their design by observing their common fail cases.) I much prefer my subset vector approach. It is simpler, it has guaranteed good run time behavior, and it never breaks down when working with dense database subsets.

Conventional free-text IR systems also (I suspect) keep lists of sentence and paragraph and document delimiters, in order to handle some types of proximity search requests. I have never found it necessary (or even desirable) to pay much attention to rigidly-defined boundaries such as sentences or paragraphs in actually performing proximity searches. The fuzzier proximity search that I have implemented is, in actual experience, much safer and more likely to find the key data items that are being sought. A user of my system will never lose a good piece of information due to an unfortunate sentence division or paragraphing choice by an author. As a secondary benefit, there is no need to pre-process items flowing into the database in order to insert paragraph, section, or document boundary markers before indexing.

BROWSING

The key to effective use of a free-text database is the ability to browse rapidly through many items in order to find the few that are of greatest interest. A good free-text IR system has to provide lists of indexed words (see WORD LISTS above) and it has to fetch the full text of documents rapidly upon demand (see READING below). But a great free-text IR system must also provide facilities to bridge the gap between word lists and full text retrieval. I do this in my programs by applying an old concept, the "key word in context" display, in a new way which I call a "context view".

A "key word in context" listing, sometimes called a KWIC, is a set of lines extracted from a database which show the instances of a word's occurrence, each with some context on either side. The "key word" is lined up in a column down the middle of the page. Static, printed KWIC indices have been around for many years, and are better than no index at all, but they don't provide fast, transparent access to data. A user still has to track down the actual reference in order to get back to the full text of the item, and she still has no way to look at a subset of the original database.

My context views are key-word-in-context displays which are generated on the fly, at run time, in response to user requests. They look like a common KWIC -- for example, a context view of six occurrences of "Macintosh" in a small file might reveal:

  rogram, without the Macintosh user interface features.  Ho
  her features of the Macintosh Browser:   - easy access to
  megabytes/hour on a Macintosh Plus, and over 60 megabytes/
   [75066,2044].  The Macintosh Browser program is available
  O-MAC> and the MAUG Macintosh Users' Forum on CompuServe.
  send me a formatted Macintosh disk and a self-addressed st

Tabs, carriage returns, line feeds, etc. are all suppressed in a context
view, to allow the user to see the maximum amount of information on each
line.

A person can scroll around in a dynamic context view in a window on the computer's screen, and quickly eliminate many of the uninteresting usages of a word. When an item looks promising, a mouse click immediately calls up the actual text from the original document for reading and possible further use. (See READING below for further comments.) When working in a proximity subset of the whole database, the context view only shows the lines surrounding word instances in the subset of interest.

My experience indicates that a human can comfortably browse through hundreds of lines in a context view in order to locate a few good ones to retrieve. It's at least an order of magnitude easier (and more pleasant!) than wading through full text documents as they are conventionally retrieved by IR systems.

The current implementation of context views in my FREE TEXT indexer/browser programs is straightforward. For each line of context to be displayed, a single disk access is needed to fetch the characters from the appropriate point in the database file. (Since the entries in the "pointers" file are already alphabetized and sorted together, one access to that file provides many windows worth of context view pointers; the cost of that access is thus negligible compared to the fetching that has to be done from the text database file.) With some operating-system caching of disk sectors as they are used, scrolling of a context view window is reasonably smooth.

In order to guarantee a fast response to user requests when working in a subset of the entire database, I currently apply a couple of simple tricks. In generating the word list, for example, I give exact counts of the number of words in a subset only if that word occurs less than 100 times (default value). Thus, for example, an entry in a word list might be:

 
    ~11%      A

I take a sample of the occurrences of "A" in the database and report
back a percentage estimate of the fraction in the subset.  (To be
precise, I default to counting the last 100 instances in the database.)

In generating a context view, when more than 100 (default value) instances of a word are skipped over because they fall outside the subset of interest, I display an indication of that as a line of dots in the context window; for example:

  ..........................................................
  ..........................................................
  megabytes/hour on a Macintosh Plus, and over 60 megabytes/
  ..........................................................
 
If I didn't use this tactic, a user might try to retrieve in an empty
subset and have to wait for the entire database to be scanned before a
"no valid instances" message could come back. I find the quicker
response of my approach superior, and I also find it useful to see an
indication (by lines of dots) when I'm working in a "desert", an
almost-empty subset.

READING

The bottom line for a free-text IR system is to get a desired document (or excerpt) in front of the user, so that it can be read and worked with. It goes without saying that a good IR system must make it easy to scroll around in a database, copy out selections, and paste them into working papers, other programs, or anyplace else that the user desires. Handling this interprocess communication is a job for the operating system, but the IR program must be at least open enough to make it very easy to export data to other programs. If the original document in the database has some structure (and thus isn't purely "free text") it may be valuable to reflect that structure in the retrieved data. Chapter or section titles can be displayed, for instance, or if there are graphics or font style changes they can be shown properly. The ultimate goal is to have the original document (or a high-resolution image of it) appear in front of the user, open to the right page, with a highlight to draw attention to the key part. That's unfortunately not easy to do today!

My current systems go only partway toward meeting this goal. I display text and highlight the index term that was the original basis for the retrieval. But I only extract a small chunk (a few thousand characters, by default) of the text document and let the user scroll around in that. It's hard to do more and still maintain the speed and simplicity of the software. I don't handle non-textual elements of the database at all, and I only display a single font.

Beyond mere retrieval, an ideal free-text IR system would make it easy to annotate retrieved information, add valuable comments and leave a trail behind so that other explorers (or the same explorer, at a later date) could profit from previous experience. Hypertextual links between documents should be easy to create (as one type of annotation, perhaps) and should be easy to see and follow. They would then provide another set of paths through the data space. All of these topics are under active research today, and perhaps within a few years robust, transportable systems to do such jobs will be available. In the meantime, however, my top priority is to do a better job at the fundamental task -- fast, easy, open, good real-time high-bandwidth free-text information retrieval on big databases.

EXTENSIONS AND ENHANCEMENTS

As mentioned repeatedly above, some significant yet straightforward extensions to my current free text IR systems are necessary in order to properly handle 100 MB/day of data. Here, I will briefly sketch out how I plan to attack the key problems during the coming months. I assume that that ample physical storage is available to hold the influx of information online, in a form which allows access to any item in a fraction of a second.

My systems have to be modified to handle multiple text files as a single database. I propose to do this by adding a third index file to the "keys" and "pointers" files -- a "filelist" index file which will simply contain a list of the database document files along with their lengths. The structure of the "keys" and "pointers" files will remain unchanged (which should maximize compatibility with earlier index programs and minimize the number of new bugs introduced in this step). The index building programs will treat each of the documents in the "filelist" file as part of a single big document for indexing purposes, and the index browsing programs will consult the "filelist" in order to know where to go to retrieve lines of context or chunks of full text for display.

A drawback of this multifile "virtual merge" approach is that it will at times be necessary to open and close files during browsing operations. I have not yet run tests, and do not know what penalties the operating system will impose (one hopes only a few milliseconds?) every time a file is opened and closed. With the use of modern operating system RAM caches, I hope that speed will not be a problem. During typical browsing operations, I believe that most database references will be predictable and localized, so caching should help average performance a lot.

Another extension which I plan to implement is to add facilities to rapidly merge separate index files upon user demand. Merging already-sorted index files is a very fast operation which should be limited only by disk I/O rates. It will then be possible to keep indices for each day's (or hour's, or whatever) collection of data separate until the time at which a user wants to browse a chosen set of files. The delay to merge the selected separate indices will be about equal to the time required for a single sequential scan through the chosen database. After that start-up delay, searches will progress at the normal full speed (sub-second response time for simple queries). Many data collections which are commonly referred to as a unit can have their merged index files kept online to avoid any search delays.

My current browser programs can already run on one computer while searching files resident elsewhere on a network. The generic UNIX version of my browser can also already be used as a "server" process and run on one host, sending back only the (relatively) small amounts of retrieved information that the user wants to see on a local machine. I plan to rewrite some parts of my browser programs to make their use as servers simpler and more efficient; this rewrite will take place along with the other revisions to introduce new features.

Index building itself should not be a computationally infeasible operation at a 100 MB/day data rate. My indexer programs already run at 15-20 MB/hour on a 16 MHz 68030 (Mac IIcx), and I have had reports of 60 MB/hour or better performance on faster machines with more memory and higher performance disk drives. I also believe that there is room for a 20% - 50% speed improvement in my indexing algorithms, by applying some of the standard quicksort enhancements discussed in many textbooks. For storing the index files, a simple and obvious modification that I plan to make is to give the user complete freedom to put databases and index files in any directory, on any volume (online storage device) that is desired. This will allow archival databases to reside on high-density optical read-only media, while indices can be built on faster magnetic media and can be moved to the archive when appropriate.

To handle databases larger than about 4 GB (2^32) will require modifications to my programs, but (assuming that disk space is available) not major trauma. If the pointers and counters in the index data structures are redeclared to be 6 bytes instead of 4 bytes, for example, it should be possible to handle up to 256 TB of text in theory. Index file overhead will go up to about 120% instead of the current 80% of the database text size. At this point, some simple compression routines might be worth exploring to increase storage efficiency, if it can be done without slowing down the retrieval process.

Proximity subset searching should still be straightforward to do using my simple vector representation of the database, but as files get bigger it may be necessary to hold the large subset vectors on disk and page them into memory only as needed. Modern operating systems with large virtual memory spaces should be able to handle that for the program automatically. If I keep the current default 32-byte quantization, then the subset vectors will still be only 0.4% as big as the total database text size, and so even in the multigigabyte zone each subset will only require a few megabytes of space.

CONCLUSIONS

My bottom-line evaluation is that a free-text IR system such as I have built, plus anticipated extensions, will surely break down -- but not until reaching file sizes in the 10 GB or larger range, or with information arriving at a rate greater than 1 GB of text per day. Then, problems with data transfer rates and with indexing speed may force one to find alternative solutions. Probably a multiprocessor approach using a partitioned database is the best tactic to take at that point.

Meanwhile, I see a lot of value still to be derived from my real-time high-bandwidth free-text information retrieval tools, particularly as the costs of data storage continue to decline.

For further information, and to submit comments and suggestions:

^z = Mark Zimmermann
"z (at) his.com"
Mark Zimmermann, P.O.Box 598, Kensington, MD 20895-0598, USA

This page has been accessed 22389 times since March 1999 ... plus an unknowable number of hits before that on previous hosts (1996-1999). It was last modified Thursday, 27-Sep-2001 18:56:25 EDT.