Keyword-in-Context

The titles of all articles in the first ten volumes of the Journal of the ACM were sorted to prepare a concordance in which there was one line for every word of every title. However, certain words like “THE” and “EQUATION” were felt to be sufficiently uninformative that they were left out of the index. These special words and their frequency of occurrence are shown in the internal nodes of Fig. 15.

Fig. 15. An optimum binary search tree for a KWIC indexing application.

Notice that a title such as “On the solution of an equation for a certain new problem” would be so uninformative, it wouldn’t appear in the index at all!

The idea of KWIC Indexing is due to H. P. Luhn, Amer. Documentation 11 (1960), 288–295. (See W. W. Youden, JACM 10 (1963), 583–646, where the full KWIC index appears. page )

When preparing a KWIC index file for sorting, we might want to use a binary search tree in order to test whether or not each particular word is to be indexed. The other words fall between two of the unindexed words, with the frequencies shown in the external nodes of Fig. 15; thus, exactly 277 words that are alphabetically between “PROBLEMS” and “SOLUTION” appeared in the JACM titles during 1954–1963.

~

KNUTH, Donald Ervin, 1997. The art of computer programming. v. 3. Sorting and Searching. 2nd ed. Reading, Mass: Addison-Wesley. ISBN 0-201-89685-0, p. 439–442.