Community Discovery

Community discovery in complex networks is an interesting problem with a number of applications, especially in the knowledge extraction task in social and information networks. However, many large networks often lack a particular community organization at a global level. In these cases, traditional graph partitioning algorithms fail to let the latent knowledge embedded in modular structure emerge, because they impose a top-down global view of a network. We propose here a simple local-first approach to community discovery, able to unveil the modular organization of real complex networks. This is achieved by democratically letting each node vote for the communities it sees surrounding it in its limited view of the global system, i.e. its ego neighborhood, using a label propagation algorithm; finally, the local communities are merged into a global collection. We tested this intuition against the state-of-the-art overlapping and non-overlapping community discovery methods, and found that our new method clearly outperforms the others in the quality of the obtained communities, evaluated by using the extracted communities to predict the metadata about the nodes of several real world networks. We also show how our method is deterministic, fully incremental, and has a limited time complexity, so that it can be used on web-scale real networks.

~

COSCIA, Michele, ROSSETTI, Giulio, GIANNOTTI, Fosca and PEDRESCHI, Dino, 2012. DEMON: a local-first discovery method for overlapping communities. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. Beijing China: ACM. 12 August 2012. p. 615–623. ISBN 978-1-4503-1462-6. DOI 10.1145/2339530.2339630. pdf

Complex network analysis has emerged as one of the most exciting domains of data analysis and mining over the last decade. One of the most prolific sub field is community discovery in complex network, or CD in short. The concept of a “Community” in a (web, social, or informational) network is intuitively understood as a set of individuals that are very similar, or close, to each other, more than to anybody else outside the community [6].

[6] Michele Coscia, Fosca Giannotti, and Dino Pedreschi. A classification for community discovery methods in complex networks. Statistical Analysis and Data Mining, 4(5):512–546, 2011.

This has often been translated in network terms into finding sets of nodes densely connected to each other and sparsely connected with the rest of the network. Community discovery can be seen as a network variant of traditional data clustering.

To efficiently detect these structures is very useful for a number of applications, ranging from targeted vaccinations and outbreak prevention [23], to viral marketing [16] and to many web data analysis tasks such as finding tribes in online information exchanges [12, 25], data compressing, clustering [4] and sampling [14].

The classical problem definition of community discovery finds a very intuitive counterpart for small networks, where the denser areas are easily identifiable by visual inspection, while the problem becomes much harder for medium and large scale networks. At the global level, very little can be said about the modular structure of the network, because on larger scales the organization of the system becomes simply too complex.

[…]

On the contrary, human eyes are good in finding denser areas in simple networks, i.e., the structure of cohesive groups of nodes that emerge considering a local fragment of an otherwise big network. But what does local mean? Commonsense goes that people are good at identifying the reasons why they know the people they know; therefore, each node has presumably an incomplete, yet clear, vision of the social communities it is part of, and that surrounds it.

The consequences of exploiting this idea for the CD problem is effectively illustrated by Figure 1(b).

Figure 1: The real world example of the “local vs global” structure intuition.

Here, we chose one of the 15k nodes from the previous example and extracted what we call its “ego minus ego” network, i.e. its ego network in which the ego node has been removed, together with all its attached edges. Suddenly, everything around the ego makes sense and some groups can be easily spotted. These groups correspond to the high school and university friends, mates from different workplaces and the members of an online community (we know all these details because the chosen ego is one of the authors of this paper). The ego is part of all these communities and knows that particular subsets of its neighborhood are part of these communities too. Probably, different egos have different perspectives over the same neighbors and it is the union of all these perspectives that creates an optimal partition of the network. In other words: if node A and node B are considered in the same communities by all the nodes connected to both A and B, then they should be grouped in the same community. This is achieved by a democratic bottom-up mining approach: in turn, each node gives the perspective of the communities surrounding it and then all the different perspectives are merged together in an overlapping structure.