KWIC Indexing

The idea of Keyword-in-Context (KWIC) indexing is due to H. P. Luhn, Amer. Documentation 11 (1960), 288–295.

See W. W. Youden, JACM 10 (1963), 583–646, where the full KWIC index appears. page

~

KNUTH, Donald Ervin, 1997. The art of computer programming. v. 3. Sorting and Searching. 2nd ed. Reading, Mass: Addison-Wesley. ISBN 0-201-89685-0, p. 439–442.

KWIC index, 439–442, 446, 494.

Fig. 15. An optimum binary search tree for a KWIC indexing application. (p. 440)

When preparing a KWIC index file for sorting, we might want to use a binary search tree in order to test whether or not each particular word is to be indexed. The other words fall between two of the unindexed words, with the frequencies shown in the external nodes of Fig. 15; thus, exactly 277 words that are alphabetically between “PROBLEMS” and “SOLUTION” appeared in the JACM titles during 1954–1963.

If we consider the KWIC Indexing application of Fig. 15 instead of the 31 commonest English words, the Trie loses its advantage because of the nature of the data. For example, a trie requires 12 iterations to distinguish between COMPUTATION and COMPUTATIONS. In this case it would be better to build the trie so that words are scanned from right to left instead of from left to right. (TAOCP, v. 3, p. 494)

p. 446: Construct a binary tree with n + 1 leaves, in such a way that σk corresponds to the Path from the root to [k] for 0 ≤ k ≤n, where ‘0’ denotes a left branch and ‘1’ denotes a right branch. Also, if σk−1 has the form αk0βk and σk has the form αk1γk for some αk, βk, and γk, let the internal node (k) correspond to the Path αk. Thus we would have

TAOCP, v. 3, p. 446

in the example above.

Note: (3) is a (k); [3] is a [k]. The circles and squares are not shown in the font.