In the last of a series of talks, Cris Moore of the Santa Fe Institute spoke about the challenges of “false positives” in community detection.
He started by illustrating the psychological origin of this challenge: people evolved to detect patterns. In the wilderness, it’s better, he argued, to find a false-positive non-tiger than a false-negative actual tiger.
So we tend to see patterns that aren’t there.
In my complex networks class, this point was demonstrated when a community-detection algorithm found a “good” partition of communities in a random graph. This is surprising because a random network doesn’t have communities.
The algorithm found a “good” identification of communities because it was looking for a good identification of communities. That is, it randomly selected a bunch of possible ways to divide the network into communities and simply returned the best one – without any thought as to what that result might mean.
Moore showed illustrated a similar finding by showing a graph whose nodes were visually clustered into two communities. One side was colored blue and the other side colored red. Only 11% of the edges crossed between the red community and the blue community. Arguably, it looked like a good identification of communities.
But then he showed another identification of communities. Red nodes and blue nodes were all intermingled, but still – only 11% of the edges connected the two communities. Numerically, both identifications were equally “good.”
This usually is a sign that you’re doing something wrong.
Informally, the challenge here is competing local optima – eg, a “glassy” surface. There are multiple “good” solutions which produce disparate results.
So if you set out to find communities, you will find communities – whether they are there or not.
Moore used a belief propagation model to consider this point further. Imagine a network where every node is trying to figure out what community it is in, and where every node tells its neighbors what communities it thinks its going to be in.
Interestingly, node i‘s message to node j with the probability that i is part of community r will be based on the information i receives from all its neighbors other than j. As you may have already gathered, this is an iterative process, so factoring j‘s message to i into i‘s message to j would just create a nightmarish feedback loop of the two nodes repeating information to each other.
Moore proposed a game: someone gives you a network telling you the average degree, k, and probability of connections to in- and out-of community, and they ask you to label the community of each node.
Now, imagine using this believe propagation model to identify the proper labeling of your nodes.
If you graph the “goodness” of the communities you find as a function of λ – where λ is (kin – kout)/2k, or the second eigenvector of the matrix representing the in- and out-community degrees of the nodes – you will find the network undergoes a phase transition at the square root of the average degree.
Moore interpreted the meaning of that phase transition – if you are looking for two communities, there is a trivial, globally fixed point at λ = 1/(square root of k). His claim – an unsolved mathematical question – is that below that fixed point communities cannot be detected. It’s not just that it’s hard for an algorithm to detect communities, but rather that the information is theoretically unknowable.
If you are looking for more than two communities, the picture changes somewhat. There is still the trivial fixed point below which communities are theoretically undetectable, but there is also a “good” fixed point where you can start to identify communities.
Between the trivial fixed point and the good fixed point, finding communities is possible, but – in terms of computational complexity – is hard.
Moore added that the good fixed point has a narrow basin of attraction – so if you start with random nodes transmitting their likely community, you will most like fall into a trivial fixed point solution.
This type of glassy landscape leads to the type of mis-identification errors discussed above – where there are multiple seemingly “good” identifications of communities within a network, but each of which vary wildly in how they assign nodes.