Category Archives: Network Analysis

Gender Representation in Comic Books

For one of my classes, I have spent this semester cleaning and analyzing data from the Grand Comics Database (GCD) with an eye towards assessing gender representation in English-language superhero comics.

Starting with GCD’s records of over 1.5 million comics from around the world, I identified the 66,000 individual comic book titles that fit my criteria. For each character appearing in those comics, I hand coded the gender for those with a self-identified male or female gender.

From this, I built a bipartite network – comic books on one side and comic book characters on the other. A comic and a character are linked if a character appeared in a comic. The resulting network has around 66,000 comic titles, 10,000 characters, and a total of nearly 300,000 links between the two sides.

From the bipartite network, I examined the projections on to each type of node. For example, the below visualization contains only characters, linking two characters if they appeared in the same issue. Nodes here are colored by publisher:

Screen Shot 2015-12-11 at 4.05.50 PM

The character network is heavily biased towards men; nearly 75% of the characters are male. Since the dataset includes comics from the 1930s to the present, this imbalance can be better assessed over time. Using the publication year of each comic, we can look at what percentage of all characters in a given year were male or female:

Screen Shot 2015-12-11 at 4.16.49 PM

While comics were very gender-skewed through the 1970s, in recent years, the balance has gotten a little better, though male character still dominate. If anyone knows what spiked the number of female characters in the early 2000s, please let know. I looked at a couple of things, but couldn’t identify the driving force behind that shift. It’s possible it just represents some inaccuracies in the original data set.

If you prefer, we can also look at the various eras of comics books to see how gender representation changed over time:

Screen Shot 2015-12-11 at 4.32.29 PM

I was particularly interested in applying a rudimentary version of the Bechdel test to this dataset. Unfortunately, I didn’t have the data to apply the full test, which asks whether two women (i) appear in the same scene, and (ii) talk to each other about (iii) something other than a man. But I could look at raw character counts for the titles in my dataset:

Screen Shot 2015-12-11 at 4.07.23 PM

I then looked at additional attributes of of those titles which pass the Bechdel test. For example, when were they published? Below are two different ways of bucketing the publication years: first by accepted comic book eras and the second by uniform time blocks. Both approaches show that having two female characters in comic books started out rare but has become more common, coinciding roughly with the overall growth of female representation in comic books.

Screen Shot 2015-12-11 at 4.38.07 PM

Finally, I could also look at the publishers of these comic books. My own biases gave me a suspicion of what I might find, but rationally I wasn’t at all sure what to expect. But now you can see, Marvel published an overwhelming number of the “Bechdel passed” comics in my dataset.

Screen Shot 2015-12-11 at 4.43.56 PM

To be fair, this graphic doesn’t account for anything more general about Marvel’s publishing habits. Marvel is known for it’s ensemble casts, for example, so perhaps they have more comics with two women simply because they have more characters in their comics.

This turns out to be partly true, but not quite enough to account for Marvel’s dominance in this area. About half of all comics with more than two characters of any gender are published by Marvel, while DC contributes about a third.


Proprietary Platform Challenges in Big Data Analysis

Today I had the opportunity to attend a great talk by Jürgen Pfeffer, Assistant Research Professor at Carnegie Mellon’s School of Computer Science. Pfeffer talked broadly about the methodological challenges of big data, social science research.

Increasingly, he argued, social scientists are reliant on data collected and curated by third party – often private – sources. As researchers, we are less intimately connected with our data, less aware of the biases that went into its collection and cleaning. Rather, in the era of social media and big data, we turn on some magical data source and watch the data flow in.

Take, for instance, Twitter – a platform whose prevalence and open API make it a popular source for scraping big datasets.

In a 2013 paper with Fred Morstatter, Huan Liu, and Kathleen M. Carley, Pfeffer assessed the representativeness of Twitter’s streaming API.  As the authors explain:

The “Twitter Streaming API” is a capability provided by Twitter that allows anyone to retrieve at most a 1% sample of all the data by providing some parameters…The methods that Twitter employs to sample this data is currently unknown.

Using Twitter’s “Firehose” – an expensive service that that allows for 100% access – the researchers compared the data provided by Twitter’s API to representative samples collected from the Firehose.

In news disturbing for computational social scientists everywhere, they found that “the Streaming API performs worse than randomly sampled data…in that case of top hashtag analysis, the Streaming API sometimes reveals negative correlation in the top hashtags, while the randomly sampled data exhibits very high positive correlation with the Firehose data.”

In one particular telling example, the team compared the raw counts from both the API and the Firehose of tweets about “Syria”. The API data shows high initial interest, tapering off around Christmas and seemingly starting to pick up again mid-January. You may be prepared to draw conclusions for this data: people are busy over the holidays, they are not on Twitter or not attentive to international issues at this time. It seems reasonable that there might be a lull.

But the firehouse data tell a different story: the API initially provides a good sample of the full dataset, but then as the API shows declining mentions, the Firehose shows a dramatic rise in mentions.


Rather than indicating a change in user activity, the decline in the streaming data is most likely do to a change in Twitter’s sampling methods. But since neither the methods nor announcements of changes to the methods are publicly available, it’s impossible for a researcher to properly know.

While these results are disconcerting, Pfeffer was quick to point out that all is not lost. Bias in research methods is an old problem; indeed, bias is inherent to the social science process. The real goal isn’t to eradicate all bias, but rather to be aware of its existence and influence. To, as his talk was titled, know your data and know your methods.


Civic Studies and Network Science

I had the delightful opportunity today to return to my former place of employment, Tufts University’s Jonathan M. Tisch College of Citizenship and Public Service, for a conversation about civic studies.

The “intellectual component of civic renewal, which is the movement to improve societies by engaging their citizens,” civic studies is the field that set me on this path towards a Ph.D. Civic studies puts citizens (of all legal statuses) at the fore, bringing together facts, values, and strategies to answer the question, “What should we do?

Ultimately, this is the question that I hope to help answer, as a person and as a scholar.

So, perhaps you can appreciate my former colleague’s confusion when they learned that my first semester coursework is in physics and math.

These are not, I suppose, the first fields one thinks of when looking to empower people to improve their communities. I am not convinced that bias is well founded, but irregardless, civic studies did primarily grow out of the social sciences and has its academic home closest to that realm.

So if my interest is in civic studies how did I end up in network science?

I hope to some day have a clear and compelling answer to that question – though it’s complicated by the fact that both fields are new and most people aren’t familiar with either of them.

The most obvious connection between civic studies and network science is around social networks. Civic studies is an inherently social field – as indicated by the “we” in what should we do? Questions of who is connected – and who is not – are critical.

For example, in Doug McAdam’s excellent book Freedom Summer, he documents the critical role of the strong social network of white, northern college students who participated in Freedom Summer. These students brought the problems of Mississippi to attention of the white mainstream, and these students went on to use the organizing skills they learned in the summer of 1964 to fuel the radical movements of the 1960s.

But networks also offer other insight into civic questions. Personally, I am particularly interested in network analysis of deliberation – exploring the exchange of ideas during deliberation and exploring how one’s own network of ideas influences they way draw on supporting arguments.

More broadly, networks can be seen throughout the civic world: not only are there networks of people and ideas, there are networks of institutions, networks of power, and the physical network of spaces that shape our world.

Networks and civics, I think, are closer than one might think.


Six Degrees of Wonder Woman: Part 2

As I mentioned previously, for one of my classes I am constructing a network of superheroes with an eye towards gender diversity in this medium.

Using data from the Grand Comics Database, I filtered down their 1.5 million unique stories to look specifically at English language comic books tagged as being in the “superhero” genre.

Each comic book record includes a list of the characters appearing in that comic book, but, unfortunately, the database doesn’t include information on characters’ identified gender. So I went through and added this information to the data set.

More generally, I also wanted to identify the unique identity of each person under a given mantel – a non-trivial task.

In the end, I ended up with the below super-hero social network. Female characters are indicated by green and make up 28% of the network. Yellow nodes indicate male characters.


Nodes are sized by degree (number of connection to other characters), and you can see from the above that male characters have, on average, a higher degree than female characters.

Since the above visualization is not very helpful, I’ve included a visualization of the top 50 nodes (by degree), below. The top 5 men and top 5 women are labeled – I had to split it up because Wonder Woman was the only woman in the top 10. If you’re wondering, the yellow node off to the top left is one Commissioner James Gordon.


Stay tuned for future analysis!



Behavioral Responses to Social Dilemmas

I had the opportunity today to attend a talk by Yamir Moreno, of the University of Zaragoza in Spain. A physicist by training, Moreno has more recently been studying game theory and human behavior, particularly in a complex systems setting.

In research published in 2012, Moreno and his team had people of various age groups play a typical prisoners dilemma game: a common scenario where an individual’s best move is to defect, but everyone suffers if everyone defects. The best outcome is for everyone to cooperate, but that can be hard to achieve since individuals have incentives to defect.

Playing in groups of 4 over several rounds, players were matched by a variable landscape – one group existed on a traditional lattice, while in another incarnation of the game players existed in a scale-free network.

As you might expect from a prisoner’s dilemma, when a person’s neighbors cooperated that person was more likely to cooperate in later rounds. When a person’s neighbors defected, that person was more likely to defect in later rounds.

Interestingly, in this first version of the experiment, Moreno found little difference between the lattice and scale-free structure.

Postulating that this was due to the static nature of the network, Moreno devised a different experiment: players were placed in an initial network structure, but they had the option to cut or add connections to other people. New connections were always reciprocal, with both parties having to agree to the connection.

He then ran this experiment over several different parameters, with some games allowing players to see each others past actions and other games having no memory.

In the setting where people could see past action, cooperation was significantly higher – about 20-30% more than expected otherwise. People who chose to defect were cut out of these networks and ultimately weren’t able to benefit from their defecting behavior.

I found this particularly interesting because earlier in the talk I had been thinking of Habermas. As interpreted by Gordon Finlayson, Habermas thought the result of standard social theory was “a false picture of society as an aggregate of lone individual reasoners, each calculating the best way of pursuing their own ends. This picture squares with a pervasive anthropological view that human beings are essentially self-interested, a view that runs from the ancient Greeks, though early modern philosophy, and right up to the present day. Modern social theory, under the influence of Hobbs or rational choice theory, thinks of society in similar terms. In Habermas’ eyes, such approaches neglect the crucial role of communication and discourse in forming social bonds between agents, and consequently have an inadequate conception of human association.”

More plainly – it is a critical feature of the Prisoners Dilemma that players are not allowed to communicate.

If the could communicate, Habermas offers, they would form communities and associate very differently than in a communications-free system.

Moreno’s second experiment didn’t include communication per se – players didn’t deliberate about their actions before taking them. But in systems with memory, a person’s actions became part of the public record – information that other players could take into account before associating with them.

In Moreno’s account, the only way for cooperators to survive is to form clusters. On the other hand, in a system with memory, a defector must join those communities as a cooperative member in order to survive.


Glassy Landscapes and Community Detection

In the last of a series of talks, Cris Moore of the Santa Fe Institute spoke about the challenges of “false positives” in community detection.

He started by illustrating the psychological origin of this challenge: people evolved to detect patterns. In the wilderness, it’s better, he argued, to find a false-positive non-tiger than a false-negative actual tiger.

So we tend to see patterns that aren’t there.

In my complex networks class, this point was demonstrated when a community-detection algorithm found a “good” partition of communities in a random graph. This is surprising because a random network doesn’t have communities.

The algorithm found a “good” identification of communities because it was looking for a good identification of communities. That is, it randomly selected a bunch of possible ways to divide the network into communities and simply returned the best one – without any thought as to what that result might mean.

Moore showed illustrated a similar finding by showing a graph whose nodes were visually clustered into two communities. One side was colored blue and the other side colored red. Only 11% of the edges crossed between the red community and the blue community. Arguably, it looked like a good identification of communities.

But then he showed another identification of communities. Red nodes and blue nodes were all intermingled, but still – only 11% of the edges connected the two communities. Numerically, both identifications were equally “good.”

This usually is a sign that you’re doing something wrong.

Informally, the challenge here is competing local optima – eg, a “glassy” surface. There are multiple “good” solutions which produce disparate results.

So if you set out to find communities, you will find communities – whether they are there or not.

Moore used a belief propagation model to consider this point further. Imagine a network where every node is trying to figure out what community it is in, and where every node tells its neighbors what communities it thinks its going to be in.

Interestingly, node i‘s message to node j with the probability that i is part of community r will be based on the information i receives from all its neighbors other than j. As you may have already gathered, this is an iterative process, so factoring j‘s message to i into i‘s message to would just create a nightmarish feedback loop of the two nodes repeating information to each other.

Moore proposed a game: someone gives you a network telling you the average degree, k, and probability of connections to in- and out-of community, and they ask you to label the community of each node.

Now, imagine using this believe propagation model to identify the proper labeling of your nodes.

If you graph the “goodness” of the communities you find as a function of λ – where λ is (kin – kout)/2k, or the second eigenvector of the matrix representing the in- and out-community degrees of the nodes – you will find the network undergoes a phase transition at the square root of the average degree.

Moore interpreted the meaning of that phase transition – if you are looking for two communities, there is a trivial, globally fixed point at λ = 1/(square root of k). His claim – an unsolved mathematical question – is that below that fixed point communities cannot be detected. It’s not just that it’s hard for an algorithm to detect communities, but rather that the information is theoretically unknowable.

If you are looking for more than two communities, the picture changes somewhat. There is still the trivial fixed point below which communities are theoretically undetectable, but there is also a “good” fixed point where you can start to identify communities.

Between the trivial fixed point and the good fixed point, finding communities is possible, but – in terms of computational complexity – is hard.

Moore added that the good fixed point has a narrow basin of attraction – so if you start with random nodes transmitting their likely community, you will most like fall into a trivial fixed point solution.

This type of glassy landscape leads to the type of mis-identification errors discussed above – where there are multiple seemingly “good” identifications of communities within a network, but each of which vary wildly in how they assign nodes.



When I started my Ph.D. program somebody warned me that being an interdisciplinary scholar is not a synonym for being mediocre at many things. Rather, choosing an interdisciplinary path means having to work just has hard as your disciplinary colleagues, but doing this equally well across multiple disciplines.

I suspect that comment doesn’t really do justice to the challenges faced by scholars within more established disciplines, but I can definitely attest to the fact that working across disciplines can be a challenge.

Having worked in academia for many years, I’d been prepared for this on a bureaucratic level. My program is affiliated with multiple departments and multiple colleges at Northeastern. No way is that going to go smoothly. Luckily, due to some amazing colleagues, I’ve hardly had do deal with the bureaucratic issues at all. In fact, I’ve been quite impressed to find that I experience the department as a well-integrated part of the university. No small feat!

But there remain scholarly challenges to being interdisciplinary.

This morning, I was reading through computer science literature on argument detection and sentiment analysis. This relatively young field has already developed an extensive literature, building off the techniques of machine learning to automatically process large bodies of text.

A number of articles included reflections how how people communicate. If someone says, “but…” that probably means they are about to present a counter argument. If someone says, “first of all…” they are probably about to present a detailed argument.

These sorts of observations are at the heart of sentiment analysis. Essentially, the computer assigns meaning to a statement by looking for patterns of key words and verbal indicators.

I was struck by how divorced these rules of speech patterns were from any social science or humanities literature. Computer scientists have been thinking about how to teach a computer to detect arguments and they’ve established their own entire literature attempting to do so. They’ve made a lot of great insights as they built the field, but – at least from the little I read today – there is something lacking from bring so siloed.

Philosophers have, in a manner of speaking, been doing “argument detection” for a lot longer than computer scientists. Surely, there is something we can learn from them.

And this is the real challenge of being interdisciplinary. As I dig into my field(s), I’m struck by the profound quantity of knowledge I am lacking. Each time I pick up a thread it leads deeper and deeper into a literature I am excited to learn – but the literatures I want to study are divergent.

I have so much to learn in the fields of math, physics, computer science, political science, sociology, philosophy, and probably a few other fields I’ve forgotten to name. Each of those topics is a rich field in it’s own right, but I have to find some way of bringing all those fields together. Not just conceptually but practically. I have to find time to learn all the things.

It’s a bit like standing in the middle of a forrest – wanting not just to find the nearest town, but to explore the whole thing.

Typical academia, I suppose, is like a depth first search – you choose your direction and you dig into it as deep as possible.

Being an interdisciplinary scholar, on the other hand, is more of a breadth first search – you have to gain a broad understanding before you can make any informed comments about the whole.


It’s No Longer Our Policy to Put Out Fires

There’s a great scene in West Wing about a fire in Yellowstone. “When something catches on fire, it’s no longer out policy to put it out?”

The scene was based off a real incident of fire management strategy. In 1988, Yellowstone suffered the largest wildfire recorded in it’s history, burning 30% of the total acreage of the park. The fires called into question the National Park Service’s “let it burn” strategy.

Implemented in 1972, this policy let natural fires run their course and remains policy today. As per a 2008 order from the director of the National Park Service, “Wildland fire will be used to protect, maintain, and enhance natural and cultural resources and, as nearly as possible, be allowed to function in its natural ecological role.”

The let it burn strategy may have had impact on the Yellowstone fires, but as a 1998 article in Science argued, the problem may have been that they hadn’t implemented the policy soon enough.

Using network analysis to model the spread of forest fires, the researchers conclude, “the best way to prevent the largest forest fires is to allow the small and medium fires to burn.”

This is because forest fires follow a power law distribution: small fires are more frequent and large fires are rare. Most fires won’t reach 1988 magnitude and will burn themselves out before doing much damage. Allowing these fires to burn mitigates the risk of larger fires – because large fires are more likely in a dense forest.

This logic can be generalizable to other systems.

A 2008 paper by Adilson E. Motter argued that cascade failures can be mitigated by intentionally removing select nodes and edges after an initial failure.

Cascade failures are typically caused when “the removal of an even small fraction of highly loaded nodes (due to attack or failure)…trigger global cascades of overload failures.” The classic example is a 2003 blackout of the northeast which was triggered by one seemingly unimportant failure. But that one failure lead to other failures which lead to other failures, and soon a large swath of the U.S. had lost power.

Motter argues that such cascades can be mitigated by acting immediately after the initial failure – intentionally removing those nodes which put more of a strain on they system in order to protect those nodes that can handle greater loads.

This strategy is not entirely unlike the “let it burn” policy of the park service. Cutting off weak nodes protects the whole and mitigates the risk of larger, catastrophic events.



Phase Transitions in Random Graphs

Yesterday, I attended a great talk on Phase Transitions in Random Graphs, the second lecture by visiting scholar Cris Moore of the Santa Fe Institute.

Now, you may be wondering, “Phase Transitions in Random Graphs”? What does that even mean?

Well, I’m glad you asked.

First, “graph” is the technical math term for a network. So we’re talking about networks here, not about random bar charts or something. The most common random graph is the Erdős–Rényi model developed by Paul Erdős and Alfred Rényi. (Interestingly, a similar model was developed separately and simultaneously by Edgar Gilbert who gets none of the credit, but that is a different post.)

The Erdős–Rényi model is simple: you have a set number of vertices and you connect two vertices with an edge with probability p.

Imagine you are a really terrible executive for a new airline company: there are a set number of airports in the world, and you randomly assign direct flights between cities. If you don’t have much start up capital, you might have a low probability of connecting two cities – resulting in a random smattering of flights. So maybe a customer could fly between Boston and New York or between San Francisco and Chicago, but not between Boston and Chicago. If your airline has plenty of capital, though, you might have a high probability of flying between two cities, resulting a connected route allowing a customer to fly from anywhere to anywhere.

The random network is a helpful baseline for understanding what network characteristics are likely to occur “by chance,” but as you may gather from the example above – real networks aren’t random. A new airline would presumably have a strategy for deciding where to fly – focusing on a region and connecting to at least a few major airports.

A phase transition in a network is similar conceptually to a phase transition in a physical system: ice undergoes a phase transition to become a liquid and can undergo another phase transition to become a gas.

A random network undergoes a phase transition when it goes from having lots of disconnected little bits to having a large component.

But when/why does this happen?

Let’s imagine a random network with nodes connected with probability p. In this network, p = k/n where k is a constant and n is the number of nodes in the network. We would then expect each node to have an average degree of k.

So if I’m a random node in this network, I can calculate the average size of the component I’m in. I am one node, connected to k nodes. Since each of those nodes are also connected to k nodes, that makes k^2 nodes connected back to me. This continues outwards as a geometric series. For small k, the geometric series formula tells us that this function will converge at 1 / (1 – k).

So we would expect something wild and crazy to happen when k = 1.

And it does.

This is called the “critical point” of a random network. It is at this point when a network goes from a random collection of disconnected nodes and small components to having a large component. This is the random network’s phase transition.


Mobile Log Data

I had the opportunity today to attend a talk by Jeffrey Boase of the University of Toronto. Boase has done extensive work around mobile log data – having research participants install apps that gather their (anonymized) call data and engaging participants in short, mobile-based surveys.

The motivation for this work can be seen in part from his earlier research –  while 40% of mobile phone use studies base their findings on self-reported data, this data correlate only moderately with the server log data. In other words, self-reported data has notable validity issues while log data provides a much more accurate picture.

Of course, phone records of call time and duration lacks the context needed to make useful inferences. So Boase works to supplement log data with more traditional data collection techniques.

A research participant, for example, may complete a daily survey asking them to self-report data on how they know a certain person in their address book. Researchers can also probe further, not only getting at familial and social relationships but also asking whether a participant enjoys discussing politics with someone.

By using this survey data in concert with log data, Boase can build real-time social networks and track how they change.

His current work, the E-Rhythms Project, seeks to provide a rich understanding of mobile phone based peer bonding during adolescence and its consequences for social capital using an innovative data collection technique that triangulates smartphone log data, on-screen survey questions, and in-depth interviews.