/* relevance-ranker --- generic tty-style version --- 1993 July --- ^z * FREE SOFTWARE under GNU GPL! (c) 1993 Mark Zimmermann * * CONCEPT: open an indexed database file (document file, *.k & *.p index files) * and give "queries" which might look like: * ZIMMERMANN (areas with lots of "ZIMMERMANN" in them) * FREE | TEXT (areas with lots of either "FREE" or "TEXT") * FREE & TEXT (areas with both "FREE" and "TEXT", the more the better) * ((FREE & TEXT) | ((INFORMATION | DATA) & RETRIEVAL)) ! ZIMMERMANN * (where ! means "AND NOT" perhaps)... etc.... * * (We also must allow run-time alteration of defaults --- key parameters are the * peak "value" of a term and its "influence" range. Higher value terms count for * more (linearly); higher influence terms have a longer distance over which * they are important. See below for details...) * * RESULT of entering a query is a set of relevance-ranked "hits" --- local * maxima of the estimated-relevance function (see below for details). Show * the user information about them on the screen, and allow paging (the way * "more" does) up & down through a hit --- maybe with emphasis for key terms * in the query (highlighting, flip upper/lower case, whatever).... * * METHOD: set up a relevance-vector with an element for each N bytes of the * database; typically use a vector of unsigned chars to store values 0-255 for * relevance values (representing relevances from 0 to 1? --- or some other * scoring method??); set N to some convenient value, probably a power of 2 * perhaps 64. (For N=64, figure that 1 MB or so of free memory could handle * a 32 MB or so database; probably will need to have a couple of relevance * vectors in memory at once, to do the "&" and "|" and "!" operations?) * * For each query term, look up the number of occurrences (in the *.k key file) * and their location (byte offset from start of the doc file, in the *.p ptr file); * start out with an empty (all-zero) relevance vector rel_vec; go through the * rel_vec and add (or multiply, or whatever) in the influence of each instance of * a word. * * For each term, let the default be a "value" of 10 and an "influence" range * of 640 bytes ... default the shape of the relevance-function for each term * to a triangle, so that for quantum N=64 bytes, a single term in isolation * produces a relevance-vector that locally looks like: * ... 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 0, ... * * Thus, for a single term, go down the relevance-vector and add in the above * pattern (or whatever modification is chosen to the default) for each instance * of the term. (May have to watch out for overflows of the 0-255 range value * that each rel_vec element?! Renormalization??) For other terms which are * OR'd (|) into the same relevance set, add their occurrences in as well.... * * For terms which are AND'd (&) in, accumulate their relevance values in a * separate rel_vector, and then multiply the rel_vectors together. (Again, * better watch out for overflows --- renormalize at need?!) * * For terms which are NOT'd (!) in, compute a rel_vector in the usual way, * but then invert it by finding the peak value and subtracting every element * from that peak, before AND'ng (multiplying rel_vectors). * * When a quantized relevance vector is computed for a given query, scan down * the rel_vec and create a list of pointers to the local maxima. Then sort * those pointers into order based on the relevance value associated with each * peak, and present the data in order of estimated relevance.... * * FIRST EFFORT: single-term relevance ranking, no frills!.... * SECOND: OR'd term set (single set).... * THIRD: AND'd groups of OR'd term sets.... * FOURTH: NOT'd terms in addition to above.... * FIFTH: wild cards (esp. terminal wild cards)... * etc. --- more features to add.... better "more"-style paging up/down, to * target strings & regexps maybe (like "isearch" in Emacs).... * * User interaction for this could be pretty simply done by asking questions; * a sample session might look like: * * PLEASE ENTER SETS OF EQUIVALENT QUERY TERMS: * SET 1> frog frogs toad toads * SET 2> predator prey eat consume diet * SET 3> * * QUERY: (FROG|FROGS|TOAD|TOADS)&(PREDATOR|PREY|EAT|CONSUME|DIET) * * 98 FROG * 17 FROGS * 49 TOAD * 8 TOADS * 976 PREDATOR * 127 PREY * 13682 EAT * 241 CONSUME * 496 DIET * * 107 MAXIMUM RELEVANCE VALUE * 57 RELEVANCE PEAKS * * < to move down a screenfull, "b" to go back a screenfull, * "n" to go to next relevant location, "p" to go to previous relevant location, * "q" to quit back to query-entry screen, maybe....>> * * THOUGHT: use SIOD, or some other tiny Scheme dialect, for an extension language, * to let the user define other relevance function shapes at runtime, define macros, * etc.?! * * THOUGHT: allow for loading of a configuration file, to set defaults up differently * if desired? Examples of parameters that need to be set at run-time for some * users: screen height & width, filtering of control characters, etc.? Someday, * worry about alternative character-mappings (*.a file) and multifile db's (*.f * file) for extended indexer/browser work.... * * THOUGHT: allow thesaurus-like lists of synonyms? Stemming? More complex wild * card options than terminal wildcards? Regexps? Some of this might be slow, if * it requires scanning through the entire *.k keyword list to do a regexp match.... * * DESIGN COMMENTS: I would like this to have maximal compatibility with various * machines ... Think C has 2-byte ints on the Mac, in default mode, so some things * that may have more than 32767 items (such as the relevance vectors) have to be * subscripted with longs .... * * NICE TO HAVE: highlighting of terms that went into the query --- but how? * Could keep track of the values looked-up in *.p file and highlight all of them * as they come out, but that might be expensive, tricky, and costly of memory.... * Could watch the output stream and highlight matches as they are being printed, * but that might be very tricky to pull off.... * * THOUGHT: in LISP/Scheme, could easily (?) keep a list of the *.p values and * could easily (?) merge in new ones (until we run out of memory?) in sorted * order ... and then could build the relevance vector by running down that list, * maybe even recognizing peaks as they occur without keeping the big vector in * actual memory and sorting it as in the C version? All that is needed is to * have fseek() and fread(), which we seem to have in SIOD?! * * SECOND THOUGHTS: maybe I should just do all of the previous paragraph in C? * There seem to be significant advantages to the fixed-quantization relevance * vector approach for big databases, as it is guaranteed not to grow huge when * fairly common words enter into a relevance judgment (though the rel_peaks[] * does have a ceiling...) * */ #include #include #include #include #ifdef THINK_C #include /* to get command-line args for Mac users */ #endif /* header definitions */ #define default_quantum 64 #define default_value 10 #define default_influence 640 #define MAXSTRLEN 1024 #define MAXRELSHAPELEN 1024 #define MAXRELPEAKS 1024 #define MAXLINES 24 #define MAXLINELEN 80 #define KEY_LENGTH 28 #define PTR_REC long #define TOP -1 #define CENTER 0 #define BOTTOM 1 /* structure of the records in the index *.k "key" file: * a fixed-length character string "kkey", filled out with blanks and * containing the unique alphanumeric 'words' in the document file, * changed to all-capital letters (default character mapping); * a cumulative count "ccount" of how many total occurrences of words, including * the current one, have happened up to this point in the alphabetized * index. * * Obviously, the number of occurrences of the nth word is just the * difference between the nth cumulative count and the (n-1)th * cumulative count. Also, the (n-1)th cumulative count is a * pointer to the first occurrence of the nth word in the *.p "ptr" file * (which holds all the index offset pointers for the document file). */ typedef struct { char kkey[KEY_LENGTH]; long ccount; } KEY_REC; /* global variables: */ FILE *doc_file, *key_file, *ptr_file; /* file pointers to document, *.k, *.p files */ long max_key_recnum, max_ptr_recnum, doc_length, rel_vec_length; unsigned char *rel_vec, *rel_vec2; unsigned char rel_shape[MAXRELSHAPELEN]; /* relevance function shape */ int rel_shape_peak, rel_shape_end, quantum, rel_peak_count; long rel_peaks[MAXRELPEAKS]; /* locations of peak relevances in rel_vec */ /* prototypes: */ void main (int argc, char *argv[]); FILE *open_file (char *fn, char *ft); char *getquerykey (char *qp, KEY_REC *query_keyrecp); void lookupquerykey (KEY_REC *this_keyrecp, KEY_REC *prev_keyrecp); void get_key_rec (KEY_REC *keyrecp, unsigned long n); PTR_REC get_ptr_rec (long n); long find_key (char *targetkey); int zstrcmp (char *s1, char *s2); void add_to_rel_vec (PTR_REC p, unsigned char *rel_vec); void fill_rel_shape (int value, int influence); int find_rel_peaks (unsigned char *rel_vec, long *rel_peaks); void sort_rel_peaks (unsigned char *rel_vec, long *rel_peaks, int rel_peak_count); void zpqsort (int first, int last, unsigned char *rel_vec, long *rel_peaks); void show_text (long target, int how, long *top_valuep, long *bottom_valuep); /* actual program: */ void main (argc, argv) int argc; char *argv[]; { int value, influence, done, r; long n, peaknum, target, top_value, bottom_value; FILE *open_file(); int ccommand(); char querystring[MAXSTRLEN], tmpstring[MAXSTRLEN], *qp; KEY_REC query_keyrec, this_keyrec, prev_keyrec; PTR_REC p, get_ptr_rec (); void exit(), fill_rel_shape(), getfirstquerykey(), lookupquerykey(); /* get UNIX-like argc & argv command-line arguments for the Mac user */ #ifdef THINK_C argc = ccommand(&argv); #endif if (argc != 2) { printf ("Sorry, wrong number of command line arguments!\n"); printf (" (need the name of an indexed database docfile)\n"); exit(0); } quantum = default_quantum; value = default_value; influence = default_influence; fill_rel_shape (value, influence); /* for starters, demand that index *.k and *.p files are named in the default * way, in the same directory as doc file --- allow overrides & graceful failure * later! */ doc_file = open_file (argv[1], ""); key_file = open_file (argv[1], ".k"); ptr_file = open_file (argv[1], ".p"); fseek (doc_file, 0L, 2); doc_length = ftell (doc_file); fseek (key_file, 0L, 2); max_key_recnum = ftell (key_file) / sizeof(KEY_REC); fseek (ptr_file, 0L, 2); max_ptr_recnum = ftell (ptr_file) / sizeof(long); printf ("Welcome to RelRank v. 0.0002, July 1993!\n"); printf ("This is free software under the GNU GPL.\n"); printf (" (c)1993 Mark Zimmermann\n\n\n"); printf ("Database \"%s\" contains:\n", argv[1]); printf ("%9ld characters\n%9ld total words\n%9ld distinct words\n\n", doc_length, max_ptr_recnum, max_key_recnum); /* put the guts here for now ... clean up & move out to functions later... * ...but it is ugly for now --- sincere apologies! */ rel_vec_length = 1 + doc_length / quantum; if ((rel_vec = (unsigned char *)calloc (rel_vec_length, 1)) == NULL) { printf ("Sorry, unable to allocate relevance vector length %ld!\n", rel_vec_length); exit (1); } printf ("Query terms (to OR together): "); fgets (querystring, MAXSTRLEN, stdin); while (strlen (querystring) != 1) { /* an empty querystring as entered is "\n", of length 1, and triggers 'quit' */ qp = querystring; while (*qp != '\0') { /* move down query string with query pointer qp, which points to the next word to be looked up in the index */ qp = getquerykey (qp, &query_keyrec); this_keyrec = query_keyrec; lookupquerykey (&this_keyrec, &prev_keyrec); /* look up in *.k file */ if (this_keyrec.kkey[0] == ' ') printf ("%.*s not found!\n", KEY_LENGTH, query_keyrec.kkey); else { printf ("%.*s occurs %ld times\n", KEY_LENGTH, this_keyrec.kkey, this_keyrec.ccount - prev_keyrec.ccount); printf ("Adding into relevance vector...\n"); for (n = prev_keyrec.ccount; n < this_keyrec.ccount; ++n) { p = get_ptr_rec (n); if (p == doc_length) printf ("\007Sorry, illegal ptr_rec fetch error! n=%ld\007\n", n); else add_to_rel_vec (p, rel_vec); } } } printf ("Finding peak relevances...\n"); rel_peak_count = find_rel_peaks (rel_vec, rel_peaks); printf ("...%d peaks found; sorting peaks...\n", rel_peak_count); sort_rel_peaks (rel_vec, rel_peaks, rel_peak_count); /* leave this out for now... maybe good for testing? For optional user display?? printf ("Most relevant areas of the database are:\n"); printf (" peak value\n"); for (n = 0; n < rel_peak_count; ++n) printf ("%6ld %3d\n", quantum * rel_peaks[n], rel_vec[rel_peaks[n]]); */ /* main user interaction loop for this query */ printf (" *** press to display query results ***\n"); fgets (tmpstring, MAXSTRLEN, stdin); peaknum = 0; target = quantum * rel_peaks[peaknum]; show_text (target, CENTER, &top_value, &bottom_value); printf (" *** %ld%% *** relevance %d-%d-%d *** peak %ld/%d ***\n", top_value * 100 / doc_length, rel_vec[top_value/quantum], rel_vec [rel_peaks[peaknum]], rel_vec[bottom_value/quantum], peaknum + 1, rel_peak_count); for (done = 0;;) { fgets (tmpstring, MAXSTRLEN, stdin); switch (tmpstring[0]) { case 'x': case 'X': /* exit entire program? */ printf ("Really exit RelRank now? "); fgets (tmpstring,MAXSTRLEN, stdin); if (tmpstring[0] == 'y' || tmpstring[0] == 'Y') exit (0); break; case 'q': case 'Q': /* finished with this query */ done = 1; break; case ' ': case 'n': case 'N': /* next page for this item */ target = bottom_value; show_text (target, TOP, &top_value, &bottom_value); printf (" *** %ld%% *** relevance %d-%d ***\n", top_value * 100 / doc_length, rel_vec[top_value/quantum], rel_vec[bottom_value/quantum]); break; case 'p': case 'P': /* previous page for this item */ target = top_value; show_text (target, BOTTOM, &top_value, &bottom_value); printf (" *** %ld%% *** relevance %d-%d ***\n", top_value * 100 / doc_length, rel_vec[top_value/quantum], rel_vec[bottom_value/quantum]); break; case '\n': case '+': /* subsequent peak */ ++peaknum; if (peaknum >= rel_peak_count) { printf ("\007 ***no more peaks! ***\007\n"); --peaknum; } else { target = quantum * rel_peaks[peaknum]; show_text (target, CENTER, &top_value, &bottom_value); printf (" *** %ld%% *** relevance %d-%d-%d *** peak %ld/%d ***\n", top_value * 100 / doc_length, rel_vec[top_value/quantum], rel_vec [rel_peaks[peaknum]], rel_vec[bottom_value/quantum], peaknum + 1, rel_peak_count); } break; case '-': /* prior peak */ --peaknum; if (peaknum < 0) { printf ("\007 *** already on first peak! ***\007\n"); ++peaknum; } else { target = quantum * rel_peaks[peaknum]; show_text (target, CENTER, &top_value, &bottom_value); printf (" *** %ld%% *** relevance %d-%d-%d *** peak %ld/%d ***\n", top_value * 100 / doc_length, rel_vec[top_value/quantum], rel_vec [rel_peaks[peaknum]], rel_vec[bottom_value/quantum], peaknum + 1, rel_peak_count); } break; case 'h': case 'H': case '?': /* help! */ printf ("\nQUERY RESULT DISPLAY COMMANDS:\n"); printf (" x,X exit RelRank program\n"); printf (" q,Q quit this query\n"); printf (" n,N, page down\n"); printf (" p,P page up\n"); printf (" +, next relevance peak\n"); printf (" - previous relevance peak\n"); printf (" h,H display this help\n"); break; default: printf ("\007 *** unrecognized command! (%c=%d) ***\007\n", tmpstring[0], tmpstring[0]); break; } if (done) break; } printf ("New query (n)? OR more terms (o)? AND new terms with old (a)?\n"); fgets (tmpstring, MAXSTRLEN, stdin); switch (tmpstring[0]) { case 'o': case 'O': case '|': /* OR some more terms in without clearing rel_vec */ printf ("Terms to OR into current query: "); fgets (querystring, MAXSTRLEN, stdin); break; case 'a': case 'A': case '&': /* AND a new query set into the current rel_vec */ if ((rel_vec2 = (unsigned char *)calloc (rel_vec_length, 1)) == NULL) { printf ("\007Sorry, insufficient memory for query ANDing!\n"); printf (" (needed %ld bytes)\007\n", rel_vec_length); printf ("We now return you to your previous query\n"); querystring[0] = '\0'; break; } printf ("Terms to AND into current query: "); fgets (querystring, MAXSTRLEN, stdin); qp = querystring; while (*qp != '\0') { qp = getquerykey (qp, &query_keyrec); this_keyrec = query_keyrec; lookupquerykey (&this_keyrec, &prev_keyrec); if (this_keyrec.kkey[0] == ' ') printf ("%.*s not found!\n", KEY_LENGTH, query_keyrec.kkey); else { printf ("%.*s occurs %ld times\n", KEY_LENGTH, this_keyrec.kkey, this_keyrec.ccount - prev_keyrec.ccount); printf ("Adding into new relevance vector...\n"); for (n = prev_keyrec.ccount; n < this_keyrec.ccount; ++n) { p = get_ptr_rec (n); if (p == doc_length) printf ("\007Sorry, illegal ptr_rec fetch error! n=%ld\007\n", n); else add_to_rel_vec (p, rel_vec2); } } } printf ("ANDing old and new relevance vectors together...\n"); for (n = 0; n < rel_vec_length; ++n) { /* normalize by dividing by 'value' (max single-term peak) */ r = rel_vec[n] * rel_vec2[n] / value; if (r > 255) { printf ("WARNING! Overflow in relevance vector calculation!\n"); printf (" at element %ld with rel_vec=%d and rel_vec2=%d\n", n, rel_vec[n], rel_vec2[n]); r = 255; } rel_vec[n] = r; } free (rel_vec2); /* again, do ugly kludge here for now ... set querystring to " " */ querystring[0] = '\0'; break; case 'n': case 'N': /* do a completely new query */ default: free (rel_vec); if ((rel_vec = (unsigned char *)calloc (rel_vec_length, 1)) == NULL) { printf ("Sorry, unable to allocate relevance vector length %ld!\n", rel_vec_length); exit (1); } printf ("\nTerms for new query: "); fgets (querystring, MAXSTRLEN, stdin); break; } } exit (0); } /* utility to open a file by appending fn and ft together for convenience... * used to get *.k and *.p files associated with a docfile.... */ FILE *open_file (fn, ft) char *fn, *ft; { FILE *fopen(), *fp; char fname[MAXSTRLEN]; void exit(); sprintf (fname, "%s%s", fn, ft); if ((fp = fopen (fname, "rb")) == NULL) { printf ("\007Sorry, fatal error in attempt to open file \"%s\"!\007\n", fname); exit (1); } return (fp); } /* take the word pointed to by qp and turn it to upper case, * strip trailing whitespace, pad with blanks, and put it into * query_keyrec.kkey .... advance qp to first char of next word, or * leave qp pointing to the '\0' at the end of querystring.... */ char *getquerykey (qp, query_keyrecp) char *qp; KEY_REC *query_keyrecp; { int i; char c; /* skip any leading whitespace */ while (isspace(*qp)) ++qp; for (i = 0; i < KEY_LENGTH; ++i) query_keyrecp->kkey[i] = ' '; for (i = 0; i < KEY_LENGTH; ++i) { c = *qp++; if (!isalnum(c)) break; if (islower(c)) c = toupper(c); query_keyrecp->kkey[i] = c; } /* skip any trailing whitespace */ while (isspace (*qp)) ++qp; return (qp); } /* look up the word in this_keyrec.kkey in the *.k key file and put it, * and previous, keyrecs in this_keyrec and prev_keyrec; if the key is not * found, null keyrecs will be returned.... */ void lookupquerykey (this_keyrecp, prev_keyrecp) KEY_REC *this_keyrecp, *prev_keyrecp; { unsigned long keyrecnum; keyrecnum = find_key (this_keyrecp->kkey); get_key_rec (this_keyrecp, keyrecnum); if (keyrecnum == 0) get_key_rec (prev_keyrecp, max_key_recnum); else get_key_rec (prev_keyrecp, keyrecnum - 1); return; } /* function to get the 'n'th KEY_REC from key_file and store it in the * KEY_REC space pointed to by recp ... note that if an illegal value * of 'n' is requested, a 'null' KEY_REC is produced.... */ void get_key_rec (keyrecp, n) KEY_REC *keyrecp; unsigned long n; { int i; if (n >= max_key_recnum) { for (i = 0; i < KEY_LENGTH; ++i) keyrecp->kkey[i] = ' '; keyrecp->ccount = 0; } else if (fseek (key_file, sizeof (KEY_REC) * n, 0) != 0) { printf ("\007Sorry, fatal error in fseek() getting key record #%ld!\007\n", n); exit(1); } else if (fread ((char *)keyrecp, sizeof(KEY_REC), 1, key_file) == 0) { printf ("\007Sorry, fatal error in fread() getting key record #%ld!\007\n", n); exit(1); } return; } /* routine to get the nth pointer record (an unsigned long integer) from ptr_file; * the pointer record gives the actual offset in the database (document) file of * the nth word in the expanded index.... Note that if an illegal value of n is * requested, the value returned is equal to doc_length, as a sign that something * funny has happened.... */ PTR_REC get_ptr_rec (n) long n; { PTR_REC p; if (n >= max_ptr_recnum) { p = doc_length; } if (fseek (ptr_file, sizeof (PTR_REC) * n, 0) != 0) { printf ("\007Sorry, fatal error in fseek() getting pointer record #%ld!\007\n", n); exit(1); } else if (fread ((char *)&p, sizeof (PTR_REC), 1, ptr_file) == 0) { printf ("\007Sorry, fatal error in fread() getting pointer record #%ld!\007\n", n); exit(1); } return (p); } /* function to find a target key ... does a simple binary search * (see K&R p.54).... returns keyrecnum, or on failure max_key_recnum.... */ long find_key (targetkey) char targetkey[]; { int c, diff; long low, high, mid; KEY_REC this_item; low = 0; high = max_key_recnum - 1; while (low <= high) { mid = (low + high) / 2; get_key_rec (&this_item, mid); diff = zstrcmp (targetkey, this_item.kkey); if (diff < 0) high = mid - 1; else if (diff > 0) low = mid + 1; else break; } if (diff != 0) mid = max_key_recnum; return (mid); } /* my function to compare two strings and give a result as to which is * alphabetically earlier. Note that this is almost the same as strncmp() * with the fixed value of KEY_LENGTH as the maximum comparison distance, * except that I must be sure to mask the characters to make them positive * (since we want to be able to handle the non-ASCII funny letters in * the Apple character set properly/consistently). If the masking isn't * done, then inconsistent results can occur with those non-ASCII chars! */ int zstrcmp (s1, s2) char *s1, *s2; { int n = KEY_LENGTH; for (; --n && ((*s1 & 0xFF) == (*s2 & 0xFF)); s1++, s2++) if (!*s1) break; return ((*s1 & 0xFF) - (*s2 & 0xFF)); } /* add into a relevance vector the relevance-function for a single word located at * "p" in the database ... logical "OR" operation.... * * This relevance-function is stored in the global array rel_shape[]. The array * starts with index 0, has its peak at index rel_shape_peak, and ends before the * element rel_shape_end.... * * METHOD: calculate the range of locations in the relevance * vector to be added into, and the offset (index) to start at in rel_shape[], then * just do the fetching & adding. * * SAFETY HINT: check to avoid overflows, and warn * thereof (or maybe offer the option to re-scale the data in real time? someday??) */ void add_to_rel_vec (p, rel_vec) PTR_REC p; unsigned char rel_vec[]; { long int i, peakindex, i0, i1, j; int r; peakindex = p / quantum; i0 = peakindex - rel_shape_peak; i1 = i0 + rel_shape_end; j = 0; if (i0 < 0) { j = -i0; i0 = 0; } if (i1 > rel_vec_length) i1 = rel_vec_length; for (i = i0; i < i1; ++i, ++j) { r = rel_vec[i] + rel_shape[j]; if (r > 255) { printf ("WARNING! Overflow in relevance vector calculation!\n"); printf (" in add_to_rel_vec() with p=%ld, i=%ld, rel_vec[i]=%d-->%d\n", p, i, rel_vec[i], r); r = 255; } rel_vec[i] = r; } return; } /* default way to initialize the relevance shape function rel_shape[], a global * array of unsigned char values ... use a triangular peak of height value and * half-width influence (bytes); quantize everything with the chosen quantum * of resolution.... Also set rel_shape_peak and rel_shape_end values here, * globals where rel_shape_peak is the center (and highest point) of the shape, * and rel_shape_end is one greater than the index of the last element to use.... */ void fill_rel_shape (value, influence) int value, influence; { int i; if (quantum <= 0 || value <= 0 || value > 255 || 2*influence/quantum+1 >= MAXRELSHAPELEN) { printf ("\007Sorry, fatal error in fill_rel_shape()!\007\n"); printf (" (quantum=%d, value=%d, influence=%d)\n", quantum, value, influence); exit (1); } rel_shape_peak = influence / quantum - 1; rel_shape_end = 2 * rel_shape_peak + 1; for (i = 0; i <= rel_shape_peak; ++i) rel_shape[i] = ((1+i) * value * quantum) / influence; for (i = rel_shape_peak+1; i < rel_shape_end; ++i) rel_shape[i] = rel_shape[2*rel_shape_peak-i]; return; } /* find the peaks of the relevance vector rel_vec and count them; store peaks * (local maxima) in rel_peaks[], but don't do more than MAXRELPEAKCOUNT of them... * return the number of peaks found and stored in rel_peaks[] ... take special * care to handle first and last elements nicely! */ int find_rel_peaks (rel_vec, rel_peaks) unsigned char *rel_vec; long *rel_peaks; { int c; long n; c = 0; if (rel_vec[0] >= rel_vec[1] && rel_vec[0] > 0) rel_peaks[c++] = 0; for (n = 1; n < rel_vec_length-1; ++n) if (rel_vec[n] > rel_vec[n-1] && rel_vec[n] >= rel_vec[n+1]) { rel_peaks[c++] = n; if (c == MAXRELPEAKS) { printf ("Warning! --- rel_peaks array full at %d items!\n", c); printf (" Peaks beyond %ld bytes in db not used!\n", n * quantum); break; } } if (rel_vec[rel_vec_length-1] > rel_vec[rel_vec_length-2] && c < MAXRELPEAKS) rel_peaks[c++] = rel_vec_length-1; return (c); } /* sort the peaks of the relevance vector into descending order based on * their height, so that most relevant items come first.... */ void sort_rel_peaks (rel_vec, rel_peaks, rel_peak_count) unsigned char *rel_vec; long *rel_peaks; int rel_peak_count; { zpqsort (0, rel_peak_count, rel_vec, rel_peaks); return; } /* my quicksort to sort the rel_peaks ptr array ... * sort elements "first" through "last" */ void zpqsort (first, last, rel_vec, rel_peaks) int first, last; unsigned char *rel_vec; long *rel_peaks; { int i, j; long tmp; while (last - first > 1) { i = first; j = last; for (;;) { while (++i < last && rel_vec[rel_peaks[i]] > rel_vec[rel_peaks[first]]) ; while (--j > first && rel_vec[rel_peaks[j]] < rel_vec[rel_peaks[first]]) ; if (i >= j) break; tmp = rel_peaks[i]; rel_peaks[i] = rel_peaks[j]; rel_peaks[j] = tmp; } tmp = rel_peaks[first]; rel_peaks[first] = rel_peaks[j]; rel_peaks[j] = tmp; if (j - first < last - (j + 1)) { zpqsort (first, j, rel_vec, rel_peaks); first = j + 1; } else { zpqsort (j + 1, last, rel_vec, rel_peaks); last = j; } } } /* display a screen full (MAXLINES) of text --- with target byte at TOP, CENTER, * or BOTTOM as 'how' parameter requests; for CENTERED, try to back up in the * text a distance MAXLINES/2 lines, counting either MAXLINELEN chars or a '\n' * as a line-equivalent ... and set the values of top_value and bottom_value * for the screen, so that paging up/down can be supported.... * * note that a little silliness here is needed to handle the use of '\r' for Macintosh * users in place of '\n', and the presence of '\r' before '\n' for PC users ... ugh! * Try to do this more or less right most of the time by outputting literally what * the file contains and by counting '\r' or '\r\n' as a single '\n'....ugh! * * this needs to be tested on diverse machines and perhaps rewritten someday?! * * there may be a nonfatal bug for small files with lots of newlines?! * * do this with pointers instead of array indices? can't imagine that efficiency * is a question here; clarity & correctness is more critical, I contend.... :-) * * filter out control characters from output too, for compatibility with binary * files?! make this optional someday?? */ void show_text (target, how, top_valuep, bottom_valuep) long target; int how; long *top_valuep, *bottom_valuep; { char txtbuf[MAXLINES*MAXLINELEN]; int c, i; long n, tp, tp0, b0; if (how == TOP) b0 = target; else if (how == CENTER) b0 = target - MAXLINES * MAXLINELEN / 2; else if (how == BOTTOM) b0 = target - MAXLINES * MAXLINELEN; if (b0 < 0) b0 = 0; if (b0 >= doc_length) { printf ("\007 *** already at bottom of file! ***\007\n"); return; } if (fseek (doc_file, b0, 0) != 0) { printf ("\007Sorry, fatal error in fseek() in show_text(), b0=%ld!\007\n", b0); exit(1); } if ((n = fread (txtbuf, sizeof (char), MAXLINES*MAXLINELEN, doc_file)) == 0) { printf ("\007Sorry, fatal error in fread() in show_text()!\007\n"); exit(1); } /* we have read n chars into txtbuf beginning at docfile byte b0; * now we must set the text buffer ptr tp to the target word's index in the buffer, * and, if CENTER or BOTTOM is requested, back up ... i counts chars * on a given line, and c counts down newlines (or groups of MAXLINELEN chars in i) * with checks for '\r' or '\r\n' as equivalents... even if TOP is the * request, try to back up 1 line, for overlap and context.... */ if (how == TOP) c = 1; else if (how == CENTER) c = MAXLINES / 2; else if (how == BOTTOM) c = MAXLINES; for (tp0 = target - b0, i = 0; tp0 > 0; --tp0) { if (txtbuf[tp0] == '\n' || txtbuf[tp0] == '\r' || ++i == MAXLINELEN) { --c; i = 0; if (tp0 > 0 && txtbuf[tp0] == '\n' && txtbuf[tp0-1] == '\r') --tp0; /* for PC users who have '\r\n' pairs */ } if (c == 0) break; } /* now print out MAXLINES lines, starting with txtbuf[tp0], counting up in c */ for (tp = tp0, i = c = 0; tp < n; ++tp) { if (iscntrl(txtbuf[tp])) putchar (' '); else putchar (txtbuf[tp]); if (txtbuf[tp] == '\n' || txtbuf[tp] == '\r' || ++i == MAXLINELEN) { ++c; i = 0; printf ("\n"); /* this should work on any system, UNIX or Mac or PC?! */ if (txtbuf[tp] == '\r' && txtbuf[tp+1] == '\n') ++tp; /* for PC users... */ if (c == MAXLINES) break; } } /* now set top_value and bottom_value properly... */ *top_valuep = tp0 + b0; *bottom_valuep = tp + b0; }