Text Classification

The self-organizing map has already found appreciation for document classification in the information retrieval community. The map display is a highly effective and intuitive metaphor for orientation in the information space established by a document collection.

In this paper we discuss ways for using self-organizing maps for document classification. Furthermore, we argue in favor of paying more attention to the fact that document collections lend themselves naturally to a hierarchical structure defined by the subject matter of the documents. We take advantage of this fact by using a hierarchically organized neural network, built up from a number of independent self-organizing maps in order to enable the true establishment of a Document Taxonomy.

As a highly convenient side effect of using such an architecture, the time needed for training is reduced substantially and the user is provided with an even more intuitive metaphor for visualization. Since the single layers of self-organizing maps represent different aspects of the document collection at different levels of detail, the neural network shows the document collection in a form comparable to an atlas where the user may easily select the most appropriate degree of granularity depending on the actual focus of interest during the exploration of the document collection.

Keywords: Document Classification; Hierarchical Feature Maps; Self-Organizing Maps; Vectorspace Model

MERKL, Dieter, 1998. Text classification with self-organizing maps: Some lessons learned. Neurocomputing. 6 November 1998. Vol. 21, no. 1, p. 61–77. DOI 10.1016/S0925-2312(98)00032-0.

⇒ Classification ⇒ Text Classification ⇒ Self-Organizing Map ⇒ Document Classification ⇒ Information Retrieval Community ⇒ Document Taxonomy

# Introduction

During recent years we have witnessed the appearance of an ever increasing flood of miscellaneous written information available in computer accessible form culminating in the advent of massive digital libraries. Powerful methods for organizing, exploring, and searching collections of text documents are thus needed to deal with that mass of information.

⇒ Information Retrieval Community

Classical methods developed by the information retrieval community for searching documents are based on keywords assigned either manually or automatically by indexing the full text of the various documents.

These methods may be enhanced with proximity search functionality and keyword combination according to Boole’s algebra.

Other widely used approaches rather rely on document similarity measures based on a vector-space representation of the various texts.

Still missing, however, are tools providing assistance for explorative search in document collections. Explorative search may be characterized as the struggle to uncover useful information when the user is unaware of appropriate keywords which could guide the search process towards relevant information. The reason for the existence of such a situation is twofold. Firstly, the user often has only limited insight in what is actually contained in the text collection and thus, has just vague expectations on what might be found. On the other hand, a usually convenient characteristic of natural language where the same fact of reality may be described in a number of different ways turns out to be a hindrance in locating relevant information because the same piece of information may be represented by using different sets of keywords. This is often referred to as the vocabulary problem in information retrieval literature [4].

⇒ Vocabulary Problem

[…]

Fig. 4. Architecture of a hierarchical feature map. ⇒ Hierarchical Feature Maps ⇒ MultilayerGraphs.jl

[…]

p. 71 A valuable property of the hierarchical feature map is the substantial speed-up of the training process as compared to conventional self-organizing maps. An explanation that goes beyond the obvious reduction of the input data dimensions emerges from an investigation in the general properties of the self-organizing training process. In self-organizing maps, the units that are subject to adaptation are selected by means of the neighborhood kernel.

It is common practice that, at the start of the training process, almost the whole map is affected by the presentation of an input pattern. With this strategy, the map is forced to establish initial clusters of similar input items at the outset of learning. By reducing the width of the neighborhood kernel in the course of learning, the training process is able to learn ever finer distinctions within the clusters while the overall topology of cluster arrangement is maintained.

The flipside of the coin, however, is that units along the boundary between two clusters tend to be occasionally modified as belonging to either one of these clusters. This interference is the reason for the time-consuming self-organizing process.

Such an interference is dramatically reduced in hierarchical feature maps. This reduction is due to the architecture of this neural network. The topology of the high-level categories is represented in the first layer of the hierarchy. Each of its sub-categories are then independently organized within separate maps at lower levels of the hierarchy. These maps in turn are free from having to represent the overall structure, as this structure is already determined by the architecture of the hierarchical feature map.

In summary, much computational effort is saved due to the fact that the overall structure of clusters is determined by the architecture of the neural network rather than by its ⇒ Learning Rule.

⇒ Hierarchical Feature Maps: The utilization of hierarchical feature maps, however, has its limitations, too. Since the architecture of the neural network in terms of the size of the various selforganizing maps and the depth of the hierarchy has to be defined prior to training some knowledge concerning the structure of the document archive is definitely necessary in order to guarantee a suitable document classification.

[…]

DOT FROM lambda-browsing