/* relevance-ranker --- generic tty-style version --- 1993 July --- ^z
 * FREE SOFTWARE under GNU GPL!  (c) 1993 Mark Zimmermann
 *
 * CONCEPT:  open an indexed database file (document file, *.k & *.p index files)
 * and give "queries" which might look like:
 *		ZIMMERMANN		(areas with lots of "ZIMMERMANN" in them)
 *		FREE | TEXT		(areas with lots of either "FREE" or "TEXT")
 *		FREE & TEXT		(areas with both "FREE" and "TEXT", the more the better)
 *		((FREE & TEXT) | ((INFORMATION | DATA) & RETRIEVAL)) ! ZIMMERMANN
 * (where ! means "AND NOT" perhaps)... etc....
 *
 * (We also must allow run-time alteration of defaults --- key parameters are the
 * peak "value" of a term and its "influence" range.  Higher value terms count for
 * more (linearly); higher influence terms have a longer distance over which
 * they are important.  See below for details...)
 *
 * RESULT of entering a query is a set of relevance-ranked "hits" --- local
 * maxima of the estimated-relevance function (see below for details).  Show
 * the user information about them on the screen, and allow paging (the way
 * "more" does) up & down through a hit --- maybe with emphasis for key terms
 * in the query (highlighting, flip upper/lower case, whatever)....
 *
 * METHOD:  set up a relevance-vector with an element for each N bytes of the
 * database; typically use a vector of unsigned chars to store values 0-255 for
 * relevance values (representing relevances from 0 to 1? --- or some other
 * scoring method??); set N to some convenient value, probably a power of 2
 * perhaps 64.  (For N=64, figure that 1 MB or so of free memory could handle
 * a 32 MB or so database; probably will need to have a couple of relevance
 * vectors in memory at once, to do the "&" and "|" and "!" operations?)
 *
 * For each query term, look up the number of occurrences (in the *.k key file)
 * and their location (byte offset from start of the doc file, in the *.p ptr file);
 * start out with an empty (all-zero) relevance vector rel_vec; go through the
 * rel_vec and add (or multiply, or whatever) in the influence of each instance of
 * a word.
 *
 * For each term, let the default be a "value" of 10 and an "influence" range
 * of 640 bytes ... default the shape of the relevance-function for each term
 * to a triangle, so that for quantum N=64 bytes, a single term in isolation
 * produces a relevance-vector that locally looks like:
 * ... 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 0, ...
 *
 * Thus, for a single term, go down the relevance-vector and add in the above
 * pattern (or whatever modification is chosen to the default) for each instance
 * of the term.  (May have to watch out for overflows of the 0-255 range value
 * that each rel_vec element?!  Renormalization??)  For other terms which are
 * OR'd (|) into the same relevance set, add their occurrences in as well....
 *
 * For terms which are AND'd (&) in, accumulate their relevance values in a
 * separate rel_vector, and then multiply the rel_vectors together.  (Again,
 * better watch out for overflows --- renormalize at need?!)
 *
 * For terms which are NOT'd (!) in, compute a rel_vector in the usual way,
 * but then invert it by finding the peak value and subtracting every element
 * from that peak, before AND'ng (multiplying rel_vectors).
 *
 * When a quantized relevance vector is computed for a given query, scan down
 * the rel_vec and create a list of pointers to the local maxima.  Then sort
 * those pointers into order based on the relevance value associated with each
 * peak, and present the data in order of estimated relevance....
 *
 * FIRST EFFORT:  single-term relevance ranking, no frills!....
 * SECOND:  OR'd term set (single set)....
 * THIRD:   AND'd groups of OR'd term sets....
 * FOURTH:  NOT'd terms in addition to above....
 * FIFTH:  wild cards (esp. terminal wild cards)...
 * etc. --- more features to add.... better "more"-style paging up/down, to
 *    target strings & regexps maybe (like "isearch" in Emacs)....
 *
 * User interaction for this could be pretty simply done by asking questions;
 * a sample session might look like:
 *
 *  PLEASE ENTER SETS OF EQUIVALENT QUERY TERMS:
 *  SET 1> frog frogs toad toads
 *  SET 2> predator prey eat consume diet
 *  SET 3>
 *
 *	QUERY:  (FROG|FROGS|TOAD|TOADS)&(PREDATOR|PREY|EAT|CONSUME|DIET)
 *
 *     98 FROG
 *     17 FROGS
 *     49 TOAD
 *      8 TOADS
 *    976 PREDATOR
 *    127 PREY
 *  13682 EAT
 *    241 CONSUME
 *    496 DIET
 *   
 *    107 MAXIMUM RELEVANCE VALUE
 *     57 RELEVANCE PEAKS
 *
 *  <<followed by a screen full, default 24 lines of 80 chars, in the region of the
 *     highest peak relevance value, centered on the peak bin in the rel_vec; display
 *     any words in the query in highlighted fashion somehow (reverse-video or reverse
 *     upper/lower case?); the last line, "more"-fashion, can show the location in
 *     the database (percentage or absolute value?) along with the local relevance
 *     value; type <space> to move down a screenfull, "b" to go back a screenfull,
 *     "n" to go to next relevant location, "p" to go to previous relevant location,
 *     "q" to quit back to query-entry screen, maybe....>>
 *
 * THOUGHT:  use SIOD, or some other tiny Scheme dialect, for an extension language,
 * to let the user define other relevance function shapes at runtime, define macros,
 * etc.?!
 *
 * THOUGHT:  allow for loading of a configuration file, to set defaults up differently
 * if desired?  Examples of parameters that need to be set at run-time for some
 * users:  screen height & width, filtering of control characters, etc.?  Someday,
 * worry about alternative character-mappings (*.a file) and multifile db's (*.f
 * file) for extended indexer/browser work....
 *
 * THOUGHT:  allow thesaurus-like lists of synonyms?  Stemming?  More complex wild
 * card options than terminal wildcards?  Regexps?  Some of this might be slow, if
 * it requires scanning through the entire *.k keyword list to do a regexp match....
 *
 * DESIGN COMMENTS:  I would like this to have maximal compatibility with various
 * machines ... Think C has 2-byte ints on the Mac, in default mode, so some things
 * that may have more than 32767 items (such as the relevance vectors) have to be
 * subscripted with longs .... 
 *
 * NICE TO HAVE:  highlighting of terms that went into the query --- but how?
 * Could keep track of the values looked-up in *.p file and highlight all of them
 * as they come out, but that might be expensive, tricky, and costly of memory....
 * Could watch the output stream and highlight matches as they are being printed,
 * but that might be very tricky to pull off....
 *
 * THOUGHT:  in LISP/Scheme, could easily (?) keep a list of the *.p values and
 * could easily (?) merge in new ones (until we run out of memory?) in sorted
 * order ... and then could build the relevance vector by running down that list,
 * maybe even recognizing peaks as they occur without keeping the big vector in
 * actual memory and sorting it as in the C version?  All that is needed is to
 * have fseek() and fread(), which we seem to have in SIOD?!
 *
 * SECOND THOUGHTS:  maybe I should just do all of the previous paragraph in C?
 * There seem to be significant advantages to the fixed-quantization relevance
 * vector approach for big databases, as it is guaranteed not to grow huge when
 * fairly common words enter into a relevance judgment (though the rel_peaks[]
 * does have a ceiling...)
 *
 */
 
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
#ifdef THINK_C
#include <console.h>    /* to get command-line args  for Mac users */
#endif

/* header definitions
 */
#define default_quantum     64
#define default_value       10
#define default_influence  640
#define MAXSTRLEN         1024
#define MAXRELSHAPELEN    1024
#define MAXRELPEAKS       1024
#define MAXLINES            24
#define MAXLINELEN          80
#define KEY_LENGTH          28
#define PTR_REC           long
#define TOP                 -1
#define CENTER               0
#define BOTTOM               1

/* structure of the records in the index *.k "key" file:
 *      a fixed-length character string "kkey", filled out with blanks and
 *              containing the unique alphanumeric 'words' in the document file,
 *              changed to all-capital letters (default character mapping);
 *      a cumulative count "ccount" of how many total occurrences of words, including
 *              the current one, have happened up to this point in the alphabetized
 *              index.
 *
 *      Obviously, the number of occurrences of the nth word is just the
 *              difference between the nth cumulative count and the (n-1)th
 *              cumulative count.  Also, the (n-1)th cumulative count is a
 *              pointer to the first occurrence of the nth word in the *.p "ptr" file
 *              (which holds all the index offset pointers for the document file).
 */
typedef struct
  {
        char kkey[KEY_LENGTH];
        long ccount;
  }  KEY_REC;


 
/* global variables:
 */
FILE *doc_file, *key_file, *ptr_file;  /* file pointers to document, *.k, *.p files */
long max_key_recnum, max_ptr_recnum, doc_length, rel_vec_length;
unsigned char *rel_vec, *rel_vec2;
unsigned char rel_shape[MAXRELSHAPELEN];  /* relevance function shape */
int rel_shape_peak, rel_shape_end, quantum, rel_peak_count;
long rel_peaks[MAXRELPEAKS];   /* locations of peak relevances in rel_vec */



/* prototypes:
 */
void main (int argc, char *argv[]);
FILE *open_file (char *fn, char *ft);
char *getquerykey (char *qp, KEY_REC *query_keyrecp);
void lookupquerykey (KEY_REC *this_keyrecp, KEY_REC *prev_keyrecp);
void get_key_rec (KEY_REC *keyrecp, unsigned long n);
PTR_REC get_ptr_rec (long n);
long find_key (char *targetkey);
int zstrcmp (char *s1, char *s2);
void add_to_rel_vec (PTR_REC p, unsigned char *rel_vec);
void fill_rel_shape (int value, int influence);
int find_rel_peaks (unsigned char *rel_vec, long *rel_peaks);
void sort_rel_peaks (unsigned char *rel_vec, long *rel_peaks, int rel_peak_count);
void zpqsort (int first, int last, unsigned char *rel_vec, long *rel_peaks);
void show_text (long target, int how, long *top_valuep, long *bottom_valuep);

/* actual program:
 */

void main (argc, argv)
int argc;
char *argv[]; {
  	int value, influence, done, r;
  	long n, peaknum, target, top_value, bottom_value;
	FILE *open_file();
    int ccommand();
    char querystring[MAXSTRLEN], tmpstring[MAXSTRLEN], *qp;
    KEY_REC query_keyrec, this_keyrec, prev_keyrec;
    PTR_REC p, get_ptr_rec ();
	void exit(), fill_rel_shape(), getfirstquerykey(), lookupquerykey();

/* get UNIX-like argc & argv command-line arguments for the Mac user
 */
#ifdef THINK_C
    argc = ccommand(&argv);
#endif

	if (argc != 2) {
	  	printf ("Sorry, wrong number of command line arguments!\n");
	  	printf ("  (need the name of an indexed database docfile)\n");
	  	exit(0);
	}

	quantum = default_quantum;
	value = default_value;
	influence = default_influence;
	fill_rel_shape (value, influence);

/* for starters, demand that index *.k and *.p files are named in the default
 * way, in the same directory as doc file --- allow overrides & graceful failure
 * later!
 */
	doc_file = open_file (argv[1], "");
	key_file = open_file (argv[1], ".k");
	ptr_file = open_file (argv[1], ".p");
	
	fseek (doc_file, 0L, 2);
	doc_length = ftell (doc_file);
	fseek (key_file, 0L, 2);
    max_key_recnum = ftell (key_file) / sizeof(KEY_REC);
	fseek (ptr_file, 0L, 2);
	max_ptr_recnum = ftell (ptr_file) / sizeof(long);
	printf ("Welcome to RelRank v. 0.0002, July 1993!\n");
	printf ("This is free software under the GNU GPL.\n");
	printf ("        (c)1993 Mark Zimmermann\n\n\n");
	printf ("Database \"%s\" contains:\n", argv[1]);
    printf ("%9ld characters\n%9ld total words\n%9ld distinct words\n\n",
                doc_length, max_ptr_recnum, max_key_recnum);

/* put the guts here for now ... clean up & move out to functions later...
 * ...but it is ugly for now --- sincere apologies!
 */
	rel_vec_length = 1 + doc_length / quantum;
	if ((rel_vec = (unsigned char *)calloc (rel_vec_length, 1)) == NULL) {
		printf ("Sorry, unable to allocate relevance vector length %ld!\n",
						rel_vec_length);
		exit (1);
	}
	printf ("Query terms (to OR together):  ");
	fgets (querystring, MAXSTRLEN, stdin);
	while (strlen (querystring) != 1)  {
	/* an empty querystring as entered is "\n", of length 1, and triggers 'quit' */
		qp = querystring;
		while (*qp != '\0') {
			/* move down query string with query pointer qp, which points to
				the next word to be looked up in the index */
			qp = getquerykey (qp, &query_keyrec);
			this_keyrec = query_keyrec;
			lookupquerykey (&this_keyrec, &prev_keyrec);  /* look up in *.k file */
			if (this_keyrec.kkey[0] == ' ')
				printf ("%.*s not found!\n", KEY_LENGTH, query_keyrec.kkey);
			else {
					printf ("%.*s occurs %ld times\n", KEY_LENGTH, this_keyrec.kkey,
						this_keyrec.ccount - prev_keyrec.ccount);
				printf ("Adding into relevance vector...\n");
				for (n = prev_keyrec.ccount; n < this_keyrec.ccount; ++n) {
					p = get_ptr_rec (n);
					if (p == doc_length)
						printf ("\007Sorry, illegal ptr_rec fetch error! n=%ld\007\n",
									n);
					else
						add_to_rel_vec (p, rel_vec);
				}
			}
		}
		printf ("Finding peak relevances...\n");
		rel_peak_count = find_rel_peaks (rel_vec, rel_peaks);
		printf ("...%d peaks found; sorting peaks...\n", rel_peak_count);
		sort_rel_peaks (rel_vec, rel_peaks, rel_peak_count);
/*  leave this out for now... maybe good for testing?  For optional user display??
		printf ("Most relevant areas of the database are:\n");
		printf ("  peak  value\n");
		for (n = 0; n < rel_peak_count; ++n)
			printf ("%6ld %3d\n", quantum * rel_peaks[n], rel_vec[rel_peaks[n]]);
 */
		/* main user interaction loop for this query */
		printf (" *** press <return> to display query results ***\n");
		fgets (tmpstring, MAXSTRLEN, stdin);
		peaknum = 0;
		target = quantum * rel_peaks[peaknum];
		show_text (target, CENTER, &top_value, &bottom_value);
		printf (" *** %ld%% *** relevance %d-%d-%d *** peak %ld/%d ***\n",
					top_value * 100 / doc_length, rel_vec[top_value/quantum],
					rel_vec [rel_peaks[peaknum]], rel_vec[bottom_value/quantum],
					peaknum + 1, rel_peak_count);
		for (done = 0;;) {
			fgets (tmpstring, MAXSTRLEN, stdin);
			switch (tmpstring[0]) {
				case 'x':
				case 'X':	/* exit entire program? */
					printf ("Really exit RelRank now?  ");
					fgets (tmpstring,MAXSTRLEN, stdin);
					if (tmpstring[0] == 'y' || tmpstring[0] == 'Y')
						exit (0);
					break;
				case 'q':
				case 'Q':	/* finished with this query */
					done = 1;
					break;
				case ' ':
				case 'n':
				case 'N':	/* next page for this item */
					target = bottom_value;
					show_text (target, TOP, &top_value, &bottom_value);
					printf (" *** %ld%% *** relevance %d-%d ***\n",
							top_value * 100 / doc_length, rel_vec[top_value/quantum],
							rel_vec[bottom_value/quantum]);
					break;
				case 'p':
				case 'P':	/* previous page for this item */
					target = top_value;
					show_text (target, BOTTOM, &top_value, &bottom_value);
					printf (" *** %ld%% *** relevance %d-%d ***\n",
							top_value * 100 / doc_length, rel_vec[top_value/quantum],
							rel_vec[bottom_value/quantum]);
					break;
				case '\n':
				case '+':	/* subsequent peak */
					++peaknum;
					if (peaknum >= rel_peak_count) {
						printf ("\007 ***no more peaks! ***\007\n");
						--peaknum;
					}
					else {
						target = quantum * rel_peaks[peaknum];
						show_text (target, CENTER, &top_value, &bottom_value);
						printf (" *** %ld%% *** relevance %d-%d-%d *** peak %ld/%d ***\n",
							top_value * 100 / doc_length, rel_vec[top_value/quantum],
							rel_vec [rel_peaks[peaknum]], rel_vec[bottom_value/quantum],
							peaknum + 1, rel_peak_count);
					}
					break;
				case '-':	/* prior peak */
					--peaknum;
					if (peaknum < 0) {
						printf ("\007 *** already on first peak! ***\007\n");
						++peaknum;
					}
					else {
						target = quantum * rel_peaks[peaknum];
						show_text (target, CENTER, &top_value, &bottom_value);
						printf (" *** %ld%% *** relevance %d-%d-%d *** peak %ld/%d ***\n",
							top_value * 100 / doc_length, rel_vec[top_value/quantum],
							rel_vec [rel_peaks[peaknum]], rel_vec[bottom_value/quantum],
							peaknum + 1, rel_peak_count);
					}
					break;
				case 'h':
				case 'H':
				case '?':	/* help! */
					printf ("\nQUERY RESULT DISPLAY COMMANDS:\n");
					printf (" x,X          exit RelRank program\n");
					printf (" q,Q          quit this query\n");
					printf (" n,N,<space>  page down\n");
					printf (" p,P          page up\n");
					printf (" +,<return>   next relevance peak\n");
					printf (" -            previous relevance peak\n");
					printf (" h,H          display this help\n");
					break;
				default:	
					printf ("\007 *** unrecognized command! (%c=%d) ***\007\n",
								tmpstring[0], tmpstring[0]);
					break;
			}
			if (done)
				break;
		}
		printf ("New query (n)?  OR more terms (o)?  AND new terms with old (a)?\n");
		fgets (tmpstring, MAXSTRLEN, stdin);
		switch (tmpstring[0]) {
			case 'o':
			case 'O':
			case '|':	/* OR some more terms in without clearing rel_vec */
				printf ("Terms to OR into current query:  ");
				fgets (querystring, MAXSTRLEN, stdin);
				break;
			case 'a':
			case 'A':
			case '&':	/* AND a new query set into the current rel_vec */
				if ((rel_vec2 = (unsigned char *)calloc (rel_vec_length, 1)) == NULL) {
					printf ("\007Sorry, insufficient memory for query ANDing!\n");
					printf ("     (needed %ld bytes)\007\n", rel_vec_length);
					printf ("We now return you to your previous query\n");
					querystring[0] = '\0';
					break;
				}
				printf ("Terms to AND into current query:  ");
				fgets (querystring, MAXSTRLEN, stdin);
				qp = querystring;
				while (*qp != '\0') {
					qp = getquerykey (qp, &query_keyrec);
					this_keyrec = query_keyrec;
					lookupquerykey (&this_keyrec, &prev_keyrec);
					if (this_keyrec.kkey[0] == ' ')
						printf ("%.*s not found!\n", KEY_LENGTH, query_keyrec.kkey);
					else {
						printf ("%.*s occurs %ld times\n", KEY_LENGTH, this_keyrec.kkey,
									this_keyrec.ccount - prev_keyrec.ccount);
						printf ("Adding into new relevance vector...\n");
						for (n = prev_keyrec.ccount; n < this_keyrec.ccount; ++n) {
							p = get_ptr_rec (n);
							if (p == doc_length)
								printf ("\007Sorry, illegal ptr_rec fetch error! n=%ld\007\n",
											n);
							else
								add_to_rel_vec (p, rel_vec2);
						}
					}
				}
				printf ("ANDing old and new relevance vectors together...\n");
				for (n = 0; n < rel_vec_length; ++n) {
					/* normalize by dividing by 'value' (max single-term peak) */
					r = rel_vec[n] * rel_vec2[n] / value;
					if (r > 255) {
						printf ("WARNING!  Overflow in relevance vector calculation!\n");
						printf ("  at element %ld with rel_vec=%d and rel_vec2=%d\n",
									n, rel_vec[n], rel_vec2[n]);
						r = 255;
					}
					rel_vec[n] = r;
				}
				free (rel_vec2);
				/* again, do ugly kludge here for now ... set querystring to "  " */
				querystring[0] = '\0';
				break;
			case 'n':
			case 'N':	/* do a completely new query */
			default:
				free (rel_vec);
				if ((rel_vec = (unsigned char *)calloc (rel_vec_length, 1)) == NULL) {
					printf ("Sorry, unable to allocate relevance vector length %ld!\n",
						rel_vec_length);
					exit (1);
				}
				printf ("\nTerms for new query:  ");
				fgets (querystring, MAXSTRLEN, stdin);
				break;
		}
	}
  	exit (0);
}


/* utility to open a file by appending fn and ft together for convenience...
 * used to get *.k and *.p files associated with a docfile....
 */

FILE *open_file (fn, ft)
char *fn, *ft;  {
	FILE *fopen(), *fp;
	char fname[MAXSTRLEN];
	void exit();
	
	sprintf (fname, "%s%s", fn, ft);
	if ((fp = fopen (fname, "rb")) == NULL)
	  {
		printf ("\007Sorry, fatal error in attempt to open file \"%s\"!\007\n",
					fname);
		exit (1);
	  }
	
	return (fp);
}

/* take the word pointed to by qp and turn it to upper case,
 * strip trailing whitespace, pad with blanks, and put it into
 * query_keyrec.kkey .... advance qp to first char of next word, or
 * leave qp pointing to the '\0' at the end of querystring....
 */
char *getquerykey (qp, query_keyrecp)
char *qp;
KEY_REC *query_keyrecp;
{
	int i;
	char c;
	
	/* skip any leading whitespace */
	while (isspace(*qp))
		++qp;
	for (i = 0; i < KEY_LENGTH; ++i)
		query_keyrecp->kkey[i] = ' ';
	for (i = 0; i < KEY_LENGTH; ++i) {
		c = *qp++;
		if (!isalnum(c))
			break;
		if (islower(c))
			c = toupper(c);
		query_keyrecp->kkey[i] = c;
	}
	/* skip any trailing whitespace */
	while (isspace (*qp))
		++qp;
	return (qp);
}

/* look up the word in this_keyrec.kkey in the *.k key file and put it,
 * and previous, keyrecs in this_keyrec and prev_keyrec; if the key is not
 * found, null keyrecs will be returned....
 */
void lookupquerykey (this_keyrecp, prev_keyrecp)
KEY_REC *this_keyrecp, *prev_keyrecp;
{
	unsigned long keyrecnum;
	
	keyrecnum = find_key (this_keyrecp->kkey);
	get_key_rec (this_keyrecp, keyrecnum);
	if (keyrecnum == 0)
		get_key_rec (prev_keyrecp, max_key_recnum);
	else
		get_key_rec (prev_keyrecp, keyrecnum - 1);
	return;
}

  
/* function to get the 'n'th KEY_REC from key_file and store it in the
 * KEY_REC space pointed to by recp ... note that if an illegal value
 * of 'n' is requested, a 'null' KEY_REC is produced....
 */
 
void get_key_rec (keyrecp, n)
KEY_REC *keyrecp;
unsigned long n; {
	int i;
    
    if (n >= max_key_recnum) {
        for (i = 0; i < KEY_LENGTH; ++i)
        	keyrecp->kkey[i] = ' ';
        keyrecp->ccount = 0;
    }
    else if (fseek (key_file, sizeof (KEY_REC) * n, 0) != 0) {
        printf ("\007Sorry, fatal error in fseek() getting key record #%ld!\007\n",
        			n);
        exit(1);
    }
    else if (fread ((char *)keyrecp, sizeof(KEY_REC), 1, key_file) == 0) {
        printf ("\007Sorry, fatal error in fread() getting key record #%ld!\007\n",
        			n);
        exit(1);
    }
    return;
}


/* routine to get the nth pointer record (an unsigned long integer) from ptr_file;
 * the pointer record gives the actual offset in the database (document) file of
 * the nth word in the expanded index.... Note that if an illegal value of n is
 * requested, the value returned is equal to doc_length, as a sign that something
 * funny has happened....
 */
 
PTR_REC get_ptr_rec (n)
long n;
{
    PTR_REC p;
    
    if (n >= max_ptr_recnum) {
    	p = doc_length;
    }
    if (fseek (ptr_file, sizeof (PTR_REC) * n, 0) != 0) {
        printf ("\007Sorry, fatal error in fseek() getting pointer record #%ld!\007\n",
        				n);
        exit(1);
    }
    else if (fread ((char *)&p, sizeof (PTR_REC), 1, ptr_file) == 0) {
        printf ("\007Sorry, fatal error in fread() getting pointer record #%ld!\007\n",
        				n);
        exit(1);
    }
    return (p);
}
  

/* function to find a target key ... does a simple binary search
 * (see K&R p.54).... returns keyrecnum, or on failure max_key_recnum....
 */

long find_key (targetkey)
  char targetkey[];
  {
    int c, diff;
    long low, high, mid;
    KEY_REC this_item;
    
    low = 0;
    high = max_key_recnum - 1;
    while (low <= high)
      {
        mid = (low + high) / 2;
        get_key_rec (&this_item, mid);
        diff = zstrcmp (targetkey, this_item.kkey);
        if (diff < 0)
            high = mid - 1;
        else if (diff > 0)
            low = mid + 1;
        else
          break;
      }
	if (diff != 0)
		mid = max_key_recnum;
    return (mid);
}


/* my function to compare two strings and give a result as to which is
 * alphabetically earlier.  Note that this is almost the same as strncmp()
 * with the fixed value of KEY_LENGTH as the maximum comparison distance,
 * except that I must be sure to mask the characters to make them positive
 * (since we want to be able to handle the non-ASCII funny letters in
 * the Apple character set properly/consistently).  If the masking isn't
 * done, then inconsistent results can occur with those non-ASCII chars!
 */

int zstrcmp (s1, s2)
char *s1, *s2;
{
	int n = KEY_LENGTH;
    
    for (; --n && ((*s1 & 0xFF) == (*s2 & 0xFF)); s1++, s2++)
        if (!*s1) break;
    return ((*s1 & 0xFF) - (*s2 & 0xFF));
}

/* add into a relevance vector the relevance-function for a single word located at
 * "p" in the database ... logical "OR" operation....
 *
 * This relevance-function is stored in the global array rel_shape[].  The array
 * starts with index 0, has its peak at index rel_shape_peak, and ends before the
 * element rel_shape_end....
 *
 * METHOD:  calculate the range of locations in the relevance
 * vector to be added into, and the offset (index) to start at in rel_shape[], then
 * just do the fetching & adding.
 *
 * SAFETY HINT:  check to avoid overflows, and warn
 * thereof (or maybe offer the option to re-scale the data in real time? someday??)
 */


void add_to_rel_vec (p, rel_vec)
PTR_REC p;
unsigned char rel_vec[];
{
	long int i, peakindex, i0, i1, j;
	int r;
	
	peakindex = p / quantum;
	i0 = peakindex - rel_shape_peak;
	i1 = i0 + rel_shape_end;
	j = 0;
	if (i0 < 0) {
		j = -i0;
		i0 = 0;
	}
	if (i1 > rel_vec_length)
		i1 = rel_vec_length;
	for (i = i0; i < i1; ++i, ++j) {
		r = rel_vec[i] + rel_shape[j];
		if (r > 255) {
			printf ("WARNING!  Overflow in relevance vector calculation!\n");
			printf ("  in add_to_rel_vec() with p=%ld, i=%ld, rel_vec[i]=%d-->%d\n",
						p, i, rel_vec[i], r);
			r = 255;
		}
		rel_vec[i] = r;
	}
	return;
}


/* default way to initialize the relevance shape function rel_shape[], a global
 * array of unsigned char values ... use a triangular peak of height value and
 * half-width influence (bytes); quantize everything with the chosen quantum
 * of resolution....  Also set rel_shape_peak and rel_shape_end values here,
 * globals where rel_shape_peak is the center (and highest point) of the shape,
 * and rel_shape_end is one greater than the index of the last element to use....
 */

void fill_rel_shape (value, influence)
int value, influence;
{
	int i;
	
	if (quantum <= 0 || value <= 0 || value > 255 ||
			2*influence/quantum+1 >= MAXRELSHAPELEN) {
		printf ("\007Sorry, fatal error in fill_rel_shape()!\007\n");
		printf ("  (quantum=%d, value=%d, influence=%d)\n", quantum, value, influence);
		exit (1);
	}
	
	rel_shape_peak = influence / quantum - 1;
	rel_shape_end = 2 * rel_shape_peak + 1;
	for (i = 0; i <= rel_shape_peak; ++i)
		rel_shape[i] = ((1+i) * value * quantum) / influence;
	for (i = rel_shape_peak+1; i < rel_shape_end; ++i)
		rel_shape[i] = rel_shape[2*rel_shape_peak-i];
	return;
}


/* find the peaks of the relevance vector rel_vec and count them; store peaks
 * (local maxima) in rel_peaks[], but don't do more than MAXRELPEAKCOUNT of them...
 * return the number of peaks found and stored in rel_peaks[] ... take special
 * care to handle first and last elements nicely!
 */

int find_rel_peaks (rel_vec, rel_peaks)
unsigned char *rel_vec;
long *rel_peaks;
{
	int c;
	long n;
	
	c = 0;
	if (rel_vec[0] >= rel_vec[1] && rel_vec[0] > 0)
		rel_peaks[c++] = 0;
	for (n = 1; n < rel_vec_length-1; ++n)
		if (rel_vec[n] > rel_vec[n-1] && rel_vec[n] >= rel_vec[n+1]) {
			rel_peaks[c++] = n;
			if (c == MAXRELPEAKS) {
				printf ("Warning! --- rel_peaks array full at %d items!\n", c);
				printf (" Peaks beyond %ld bytes in db not used!\n", n * quantum);
				break;
			}
		}
	if (rel_vec[rel_vec_length-1] > rel_vec[rel_vec_length-2] && c < MAXRELPEAKS)
		rel_peaks[c++] = rel_vec_length-1;
	return (c);
}

/* sort the peaks of the relevance vector into descending order based on
 * their height, so that most relevant items come first.... 
 */

void sort_rel_peaks (rel_vec, rel_peaks, rel_peak_count)
unsigned char *rel_vec;
long *rel_peaks;
int rel_peak_count;
{
	zpqsort (0, rel_peak_count, rel_vec, rel_peaks);
	return;
}


/* my quicksort to sort the rel_peaks ptr array ... 
 *  sort elements "first" through "last"
 */

void zpqsort (first, last, rel_vec, rel_peaks)
int first, last;
unsigned char *rel_vec;
long *rel_peaks;
{
	int i, j;
	long tmp;
 
    while (last - first > 1) {
    	i = first;
    	j = last;
        for (;;) {
            while (++i < last && rel_vec[rel_peaks[i]] > rel_vec[rel_peaks[first]])
                ;
            while (--j > first && rel_vec[rel_peaks[j]] < rel_vec[rel_peaks[first]])
                ;
            if (i >= j)
                 break;
        	tmp = rel_peaks[i];
            rel_peaks[i] = rel_peaks[j];
            rel_peaks[j] = tmp;
		}
        tmp = rel_peaks[first];
        rel_peaks[first] = rel_peaks[j];
        rel_peaks[j] = tmp;
        if (j - first < last - (j + 1)) {
            zpqsort (first, j, rel_vec, rel_peaks);
            first = j + 1;
     	}
        else {
            zpqsort (j + 1, last, rel_vec, rel_peaks);
            last = j;
		}
	}
}


/* display a screen full (MAXLINES) of text --- with target byte at TOP, CENTER,
 * or BOTTOM as 'how' parameter requests; for CENTERED, try to back up in the
 * text a distance MAXLINES/2 lines, counting either MAXLINELEN chars or a '\n'
 * as a line-equivalent ... and set the values of top_value and bottom_value
 * for the screen, so that paging up/down can be supported....
 *
 * note that a little silliness here is needed to handle the use of '\r' for Macintosh
 * users in place of '\n', and the presence of '\r' before '\n' for PC users ... ugh!
 * Try to do this more or less right most of the time by outputting literally what
 * the file contains and by counting '\r' or '\r\n' as a single '\n'....ugh!
 *
 * this needs to be tested on diverse machines and perhaps rewritten someday?!
 *
 * there may be a nonfatal bug for small files with lots of newlines?!
 *
 * do this with pointers instead of array indices?  can't imagine that efficiency
 * is a question here; clarity & correctness is more critical, I contend.... :-)
 *
 * filter out control characters from output too, for compatibility with binary
 * files?!  make this optional someday??
 */

void show_text (target, how, top_valuep, bottom_valuep)
long target;
int how;
long *top_valuep, *bottom_valuep;
{
	char txtbuf[MAXLINES*MAXLINELEN];
	int c, i;
	long n, tp, tp0, b0;
	
	if (how == TOP)
		b0 = target;
	else if (how == CENTER)
		b0 = target - MAXLINES * MAXLINELEN / 2;
	else if (how == BOTTOM)
		b0 = target - MAXLINES * MAXLINELEN;
	
	if (b0 < 0)
		b0 = 0;
	if (b0 >= doc_length) {
		printf ("\007 *** already at bottom of file! ***\007\n");
		return;
	}
	if (fseek (doc_file, b0, 0) != 0) {
        printf ("\007Sorry, fatal error in fseek() in show_text(), b0=%ld!\007\n",
        			b0);
        exit(1);
    }
    if ((n = fread (txtbuf, sizeof (char), MAXLINES*MAXLINELEN, doc_file)) == 0) {
        printf ("\007Sorry, fatal error in fread() in show_text()!\007\n");
        exit(1);
    }
	/* we have read n chars into txtbuf beginning at docfile byte b0;
	 * now we must set the text buffer ptr tp to the target word's index in the buffer,
	 * and, if CENTER or BOTTOM is requested, back up ... i counts chars
	 * on a given line, and c counts down newlines (or groups of MAXLINELEN chars in i)
	 * with checks for '\r' or '\r\n' as equivalents... even if TOP is the
	 * request, try to back up 1 line, for overlap and context....
	 */
	if (how == TOP)
		c = 1;
	else if (how == CENTER)
		c = MAXLINES / 2;
	else if (how == BOTTOM)
		c = MAXLINES;
	for (tp0 = target - b0, i = 0; tp0 > 0; --tp0) {
		if (txtbuf[tp0] == '\n' || txtbuf[tp0] == '\r' || ++i == MAXLINELEN) {
			--c;
			i = 0;
			if (tp0 > 0 && txtbuf[tp0] == '\n' && txtbuf[tp0-1] == '\r')
				--tp0;		/* for PC users who have '\r\n' pairs */
		}
		if (c == 0)
			break;
	}
	/* now print out MAXLINES lines, starting with txtbuf[tp0], counting up in c
	 */
	for (tp = tp0, i = c = 0; tp < n; ++tp) {
		if (iscntrl(txtbuf[tp]))
			putchar (' ');
		else
			putchar (txtbuf[tp]);
		if (txtbuf[tp] == '\n' || txtbuf[tp] == '\r' || ++i == MAXLINELEN) {
			++c;
			i = 0;
			printf ("\n");	/* this should work on any system, UNIX or Mac or PC?! */
			if (txtbuf[tp] == '\r' && txtbuf[tp+1] == '\n')
				++tp;		/* for PC users... */
			if (c == MAXLINES)
				break;
		}
	}
	/* now set top_value and bottom_value properly...
	 */
	*top_valuep = tp0 + b0;
	*bottom_valuep = tp + b0;
}

