FreeText Archive
Welcome! This is ^z's directory of the programs he has written for
real-time high-bandwidth large-scale free-text information
retrieval --- the (in)famous "FreeText Project".
Here you will find all the source code and documentation
that I currently have available for FreeText IR.
But first, a disclaimer: This directory contains
free software under the GNU General Public License.
These programs are (virtually) unsupported, and come with
no warranty, express or implied, etc., etc. Don't use any of
them to control nuclear reactors, aircraft, critical life
support systems, or to do anything else dangerous.
(Except for those most dangerous of activities,
thinking and learning!)
On the positive side, if you send a friendly note to
"z (at) his.com then I'll
try to reply to you with what semi-helpful comments
I can come up with. I don't usually respond to mail until
I've thought about it for several days, so be patient
with me. And please don't expect too much --- I haven't
worked on these programs for several years now. Worse news,
I have no Macintosh hardware, and so can probably not be of
much aid on the FreeText/Tex/Texas HyperCard stack front.
Sorry!
This page has been accessed 38473 times.
It was last modified Sunday, 01-May-2005 20:50:59 EDT.
Quick-Start Instructions
If you want to experiment with a 32-bit DOS version of
FreeText, the following may help you begin the journey:
- Create a directory and put one or more text files in it which you
wish to browse. (The works of Shakespeare, or the Bible, or
your class notes, or whatever you prefer.)
- Prepare a text file with a name extension ".F" containing
a list of the file(s) you wish to browse, one per line. Full
paths are ok, or just the file names (with extensions) if you
plan to browse from within this directory. Call the file list
"FLIST.F" for purposes of these instructions.
- Copy the files ZINDEX.EXE,
ZMERGE.EXE, and
ZBROWSE.EXE
into the directory with FLIST.F and the database file(s).
- Execute the command: ZINDEX FLIST.F
- Execute the command: ZMERGE FLIST.F
- The result should be a rewritten FLIST.F file (you may
look at it, but do not edit it) and two binary index files
named FLIST.K (keywords) and FLIST.P (pointers).
- Now execute the command: ZBROWSE word FLIST
where "word" is any word. You should find yourself browsing
a list of all the words in the database file(s), in alphabetical
order, with a display showing the number of occurrences of each.
- Use the keyboard up- and down-arrow keys (or ^P and ^N)
to move up and down in the word list, or type in any word and
hit the return key to jump to that word.
- With the cursor on a word that interests you, hit the
right-arrow key to drop into a key-word-in-context display
of all occurrences of your chosen word with half a line
of context on each side. You can scroll around in this KWIC
display with the up- and down-arrow keys, as in the word list;
you can return to the word list with the left-arrow key.
- With the cursor on a line of key-word-in-context that
interests you, hit the right-arrow key to drop down into
the full text of your database at that point. Scroll around
in the text with the up- and down-arrow keys, and return
to the KWIC or the index word list with the left-arrow.
- Type "/?" to get a help screen summarizing available
commands. You can use the "subset" browsing feature to
do fuzzy boolean proximity searching within a subset
of the database, based on chosen words.
- Type "/Q" to quit.
You can set the environment variable BRWSR_DBASE to
a database name, for example via "SET BRWSR_DBASE=FLIST",
and then that database will become the default which will
be opened when you run ZBROWSE.
The PC executables were compiled and provided by an
anonymous friend. I think that the *.EXE files were derived
from the zndxr.c, zmrgr.c, and zbrwsr.c sources here, with the
addition of code to provide a simple windowing interface that
responds to cursor movement keys and other commands --- but
I have no way to verify that hypothesis. So it is likely that I
will be unable to help much in answering questions about these
DOS programs. Nevertheless, they seem to work well in my
experiments. I found these files on a Numerical Recipes
CD-ROM, for which I thank Professor William Press, gentleman
and physicist.
FreeText Archive Annotated Directory
In this archive, the files you will find are:
- ft103.zip (271kB) --- the "final" version of the Macintosh HyperCard stack "FreeText version 1.03", available thanks to the kind help of Nick Thieberger (Cf. http://www.linguistics.unimelb.edu.au/thieberger/)
- texas.zip (665kB) --- an early version of the Mac free-text IR HyperCard stack, also available thanks to Nick Thieberger
- ZBROWSE.EXE (101kB) --- the
compiled 32-bit DOS executable used to browse indexed free-text
databases. See the "Quick-Start Instructions" above for details
on how to use it; see zbrwsr58msg.c
for heavily annotated source code. (Note that zbrwsr.c
does not implement the cursor-controlled DOS windowing
interface of ZBROWSE.EXE.)
- ZINDEX.EXE (49kB) --- the
compiled DOS program to build individual index files for free-text
search and retrieval. See zndxr41msg.c
for source code and commentary, and see "Quick-Start Instructions"
above for usage.
- ZMERGE.EXE (52kB) --- the
compiled DOS program to merge together multifile database
indices and to generate the correct *.F, *.K, and *.P files for
free-text browsing. See zmrgr46msg.c
for source code with comments, and see "Quick-Start Instructions"
above for hints on how to make it work for you.
- browser.c (48kB) --- the source code
for a generic program to use indices built by indexer.c and
do simple command-line browsing on single-file databases.
This software generates
index word lists with frequency counts and key-word-in-context
(KWIC) displays, and retrieves individual chunks from the
database. The "browser.c" code was originally written ca. 1986
and is a fine starting point to become acquainted with FreeText
indexer/browser software by reading the heavily commented sources.
It has been compiled and run on under AmigaOS, DOS, MacOS, various
flavors of UNIX, and VMS, on systems up to and including Cray supercomputers.
- countchar.c (1kB) --- the source
code for a tiny and obvious program to count how many characters
of each type appear in an input stream or file (written many years
ago for a friend who didn't know how trivial it was)
- indexer.c (50kB) --- the source
code for a generic program to build binary (machine-readable,
not human-readable) indices to single text files. This software
was originally written ca. 1986 and, like "browser.c", is an
excellent starting point for reading into the sources and learning
about FreeText information retrieval data structures and
algorithms. The C is simple and heavily annotated, and has run on numerous systems without
significant changes.
- relrank36.c (32kB) --- the source
code for an experiment in relevance-ranked information retrieval
from FreeText. This code is rather buggy and should be read with
considerable skepticism!
- zbrwsr58msg.c (40kB) --- a
message containing the C source code plus commentary
for "zbrwsr.c" version 58, a fairly well-tested and debugged
command-line user interface to multifile (*.f, *.k, *.p) FreeText
databases. It is likely a basis for ZBROWSE.EXE, which adds DOS
arrow-key commands and
other interface features to the zbrwsr.c command-line system.
Read this code for ideas and concepts associated with multifile
index browsing, but note that the earlier and more straightforward
browser.c may be a better starting
point for study.
- zcmpndx04msg.c (11kB) --- a
message containing and describing the experimental program
zcmpndx.c to compare two indexed database files and identify
their differences in a statistically-interesting way. Specifically,
zcmpndx takes as input two *.k "key" index files and analyzes
them to determine which words appear abnormally often or
seldom in one relative to the other. The code is simple and may
be of interest to those working on linguistic analyses, or to
people looking for ways to characterize a small file relative
to a large "average" word frequency distribution.
- zftir015.c (26kB) --- this code
is a fragmentary experiment in writing a CGI interface to a
FreeText indexed database. It is extremely buggy and
should be read with extraordinary skepticism!
- zmrgr46msg.c (37kB) --- a
message documenting and including source code for version 46
of the zmerge.c program to merge multifile database index
files created by zndxr.c; it is likely the basis of the compiled
ZMERGE.EXE DOS executable.
The zmrgr.c code is worth reading for those who wish to learn
about the internals of multifile FreeText databases.
- zndxr41msg.c (35kB) --- a
message documenting and including source code for version 41 of
a program, zndxr, to index multiple files, using a "filelist" (*.f)
table. This code is likely the basis of the DOS executable
ZINDEX.EXE, and may be read
to learn about FreeText index building, though probably the
older and simpler indexer.c is a
better starting point.
Final Remarks
Since the mid-1980's I have enjoyed struggling with and
learning from the FreeText project, and would like to acknowledge
the aid of numerous people who provided splendid questions,
suggestions, contributions, and encouragement in this activity.
Thank you all! And thank you to the thousands of people
who have used FreeText for linguistic research, indexing books
and CD-ROMs, searching literary archives, and other applications.
There are many areas in which FreeText needs further work.
Probably the programs in this archive are best used as starting
points, inspirations but not roadmaps for new programs in Java, Perl,
Scheme, or other appropriate languages. Among the key development
issues to watch, I would recommend focusing early and often on:
- user interface features --- see, for starters, the essay
FreeText
Information Retrieval Philosophy for thoughts on vital
attributes of free-text IR interfaces which the
FreeText programs implement.
(This is harder than it looks!)
- non-Latin alphabets --- there are many challenges
associated with multibyte characters, mapping equivalent
characters together, properly handling alphabetization order
variations, diacritical marks, etc.
(This is harder than it looks!)
- efficiency --- in index-building, in index-updating,
in handling large numbers of database files, in browsing, in
performing proximity-subset search and retrieval, and so forth.
It is important to save both time and space, particularly in
responding to the commonest user requirements.
(This is harder than it looks!)
Please let me know if you use FreeText software and if it
helps you in your work. You may write to me at
"z (at) his.com. My
paper mail address is:
Mark Zimmermann
P. O. Box 598
Kensington, MD 20895-0598
USA
Best,
Silver Spring, Maryland, USA
updated November 1999 & February 2001 & May 2005