FreeText Archive

Welcome! This is ^z's directory of the programs he has written for real-time high-bandwidth large-scale free-text information retrieval --- the (in)famous "FreeText Project". Here you will find all the source code and documentation that I currently have available for FreeText IR.

But first, a disclaimer: This directory contains free software under the GNU General Public License. These programs are (virtually) unsupported, and come with no warranty, express or implied, etc., etc. Don't use any of them to control nuclear reactors, aircraft, critical life support systems, or to do anything else dangerous. (Except for those most dangerous of activities, thinking and learning!)

On the positive side, if you send a friendly note to "z (at) his.com then I'll try to reply to you with what semi-helpful comments I can come up with. I don't usually respond to mail until I've thought about it for several days, so be patient with me. And please don't expect too much --- I haven't worked on these programs for several years now. Worse news, I have no Macintosh hardware, and so can probably not be of much aid on the FreeText/Tex/Texas HyperCard stack front. Sorry!

This page has been accessed 40640 times. It was last modified Sunday, 01-May-2005 20:50:59 EDT.

Quick-Start Instructions

If you want to experiment with a 32-bit DOS version of FreeText, the following may help you begin the journey:

Create a directory and put one or more text files in it which you wish to browse. (The works of Shakespeare, or the Bible, or your class notes, or whatever you prefer.)
Prepare a text file with a name extension ".F" containing a list of the file(s) you wish to browse, one per line. Full paths are ok, or just the file names (with extensions) if you plan to browse from within this directory. Call the file list "FLIST.F" for purposes of these instructions.
Copy the files ZINDEX.EXE, ZMERGE.EXE, and ZBROWSE.EXE into the directory with FLIST.F and the database file(s).
Execute the command: ZINDEX FLIST.F
Execute the command: ZMERGE FLIST.F
The result should be a rewritten FLIST.F file (you may look at it, but do not edit it) and two binary index files named FLIST.K (keywords) and FLIST.P (pointers).
Now execute the command: ZBROWSE word FLIST
where "word" is any word. You should find yourself browsing a list of all the words in the database file(s), in alphabetical order, with a display showing the number of occurrences of each.
Use the keyboard up- and down-arrow keys (or ^P and ^N) to move up and down in the word list, or type in any word and hit the return key to jump to that word.
With the cursor on a word that interests you, hit the right-arrow key to drop into a key-word-in-context display of all occurrences of your chosen word with half a line of context on each side. You can scroll around in this KWIC display with the up- and down-arrow keys, as in the word list; you can return to the word list with the left-arrow key.
With the cursor on a line of key-word-in-context that interests you, hit the right-arrow key to drop down into the full text of your database at that point. Scroll around in the text with the up- and down-arrow keys, and return to the KWIC or the index word list with the left-arrow.
Type "/?" to get a help screen summarizing available commands. You can use the "subset" browsing feature to do fuzzy boolean proximity searching within a subset of the database, based on chosen words.
Type "/Q" to quit.

You can set the environment variable BRWSR_DBASE to a database name, for example via "SET BRWSR_DBASE=FLIST", and then that database will become the default which will be opened when you run ZBROWSE.

The PC executables were compiled and provided by an anonymous friend. I think that the *.EXE files were derived from the zndxr.c, zmrgr.c, and zbrwsr.c sources here, with the addition of code to provide a simple windowing interface that responds to cursor movement keys and other commands --- but I have no way to verify that hypothesis. So it is likely that I will be unable to help much in answering questions about these DOS programs. Nevertheless, they seem to work well in my experiments. I found these files on a Numerical Recipes CD-ROM, for which I thank Professor William Press, gentleman and physicist.

FreeText Archive Annotated Directory

In this archive, the files you will find are:

ft103.zip (271kB) --- the "final" version of the Macintosh HyperCard stack "FreeText version 1.03", available thanks to the kind help of Nick Thieberger (Cf. http://www.linguistics.unimelb.edu.au/thieberger/)
texas.zip (665kB) --- an early version of the Mac free-text IR HyperCard stack, also available thanks to Nick Thieberger
ZBROWSE.EXE (101kB) --- the compiled 32-bit DOS executable used to browse indexed free-text databases. See the "Quick-Start Instructions" above for details on how to use it; see zbrwsr58msg.c for heavily annotated source code. (Note that zbrwsr.c does not implement the cursor-controlled DOS windowing interface of ZBROWSE.EXE.)
ZINDEX.EXE (49kB) --- the compiled DOS program to build individual index files for free-text search and retrieval. See zndxr41msg.c for source code and commentary, and see "Quick-Start Instructions" above for usage.
ZMERGE.EXE (52kB) --- the compiled DOS program to merge together multifile database indices and to generate the correct *.F, *.K, and *.P files for free-text browsing. See zmrgr46msg.c for source code with comments, and see "Quick-Start Instructions" above for hints on how to make it work for you.
browser.c (48kB) --- the source code for a generic program to use indices built by indexer.c and do simple command-line browsing on single-file databases. This software generates index word lists with frequency counts and key-word-in-context (KWIC) displays, and retrieves individual chunks from the database. The "browser.c" code was originally written ca. 1986 and is a fine starting point to become acquainted with FreeText indexer/browser software by reading the heavily commented sources. It has been compiled and run on under AmigaOS, DOS, MacOS, various flavors of UNIX, and VMS, on systems up to and including Cray supercomputers.
countchar.c (1kB) --- the source code for a tiny and obvious program to count how many characters of each type appear in an input stream or file (written many years ago for a friend who didn't know how trivial it was)
indexer.c (50kB) --- the source code for a generic program to build binary (machine-readable, not human-readable) indices to single text files. This software was originally written ca. 1986 and, like "browser.c", is an excellent starting point for reading into the sources and learning about FreeText information retrieval data structures and algorithms. The C is simple and heavily annotated, and has run on numerous systems without significant changes.
relrank36.c (32kB) --- the source code for an experiment in relevance-ranked information retrieval from FreeText. This code is rather buggy and should be read with considerable skepticism!
zbrwsr58msg.c (40kB) --- a message containing the C source code plus commentary for "zbrwsr.c" version 58, a fairly well-tested and debugged command-line user interface to multifile (*.f, *.k, *.p) FreeText databases. It is likely a basis for ZBROWSE.EXE, which adds DOS arrow-key commands and other interface features to the zbrwsr.c command-line system. Read this code for ideas and concepts associated with multifile index browsing, but note that the earlier and more straightforward browser.c may be a better starting point for study.
zcmpndx04msg.c (11kB) --- a message containing and describing the experimental program zcmpndx.c to compare two indexed database files and identify their differences in a statistically-interesting way. Specifically, zcmpndx takes as input two *.k "key" index files and analyzes them to determine which words appear abnormally often or seldom in one relative to the other. The code is simple and may be of interest to those working on linguistic analyses, or to people looking for ways to characterize a small file relative to a large "average" word frequency distribution.
zftir015.c (26kB) --- this code is a fragmentary experiment in writing a CGI interface to a FreeText indexed database. It is extremely buggy and should be read with extraordinary skepticism!
zmrgr46msg.c (37kB) --- a message documenting and including source code for version 46 of the zmerge.c program to merge multifile database index files created by zndxr.c; it is likely the basis of the compiled ZMERGE.EXE DOS executable. The zmrgr.c code is worth reading for those who wish to learn about the internals of multifile FreeText databases.
zndxr41msg.c (35kB) --- a message documenting and including source code for version 41 of a program, zndxr, to index multiple files, using a "filelist" (*.f) table. This code is likely the basis of the DOS executable ZINDEX.EXE, and may be read to learn about FreeText index building, though probably the older and simpler indexer.c is a better starting point.

Final Remarks

Since the mid-1980's I have enjoyed struggling with and learning from the FreeText project, and would like to acknowledge the aid of numerous people who provided splendid questions, suggestions, contributions, and encouragement in this activity. Thank you all! And thank you to the thousands of people who have used FreeText for linguistic research, indexing books and CD-ROMs, searching literary archives, and other applications.

There are many areas in which FreeText needs further work. Probably the programs in this archive are best used as starting points, inspirations but not roadmaps for new programs in Java, Perl, Scheme, or other appropriate languages. Among the key development issues to watch, I would recommend focusing early and often on:

user interface features --- see, for starters, the essay FreeText Information Retrieval Philosophy for thoughts on vital attributes of free-text IR interfaces which the FreeText programs implement. (This is harder than it looks!)
non-Latin alphabets --- there are many challenges associated with multibyte characters, mapping equivalent characters together, properly handling alphabetization order variations, diacritical marks, etc. (This is harder than it looks!)
efficiency --- in index-building, in index-updating, in handling large numbers of database files, in browsing, in performing proximity-subset search and retrieval, and so forth. It is important to save both time and space, particularly in responding to the commonest user requirements. (This is harder than it looks!)

Please let me know if you use FreeText software and if it helps you in your work. You may write to me at "z (at) his.com. My paper mail address is:

Mark Zimmermann
P. O. Box 598
Kensington, MD 20895-0598
USA

Best,

^z = Mark Zimmermann

Silver Spring, Maryland, USA
updated November 1999 & February 2001 & May 2005