Monthly Archives: September 2016

Node Overlap Removal by Growing a Tree

I recently read Lev Nachmanson, Arlind Nocaj, Sergey Bereg, Leishi Zhang, and Alexander Holroyd’s article on “Node Overlap Removal by Growing a Tree,” which presents a really interesting method.

Using a minimum spanning tree to deal with overlapping nodes seems like a really innovative technique. It made me wonder how the authors came up with this approach!

As outlined in the paper, the algorithm begins with a Delaunay triangulation on the node centers – more information on Delaunay triangulations here – but its essentially a maximal planar subdivision of the graph: eg, you draw triangles connecting the centers of all the nodes.

From here, the algorithm finds the minimal spanning tree, where the cost of an edge is defined so that greater node overlap the lower the cost. The minimal spanning tree, then, find the maximal overlaps in the graph. The algorithm then “grows” the tree: increasing the cost of the tree by lengthening edges. Starting at the root, the lengthening propagates outwards. The algorithm repeats no overlaps exist on the edge of the triangulation.

Impressively, this algorithm runs in O(|V|) time per iteration, making it a fast as well as an effective algorithm.

On the Rights of a Black Man

I was struck by a comment from today’s coverage of the shooting death of an unarmed black man. To be clear, this was coverage of the death of an unarmed black man – whose name has not yet been released – in San Diego; not the recent shooting of Keith Scott in Charolette, or of Terence T. Crutcher in Tulsa.

In San Diego, a woman called 911 to get help for her mentally ill brother. Details are contested, but police shot and killed the man they’d been called to help.

In an interview this morning, a woman protesting the murder said: “Because he was black he automatically had no rights.”

That was a profound statement.

Because he was black, he automatically had no rights.

Regardless of whether or not you agree with that statement, the mere perception of that reality should be disturbing. And, incidentally, if you don’t agree with that statement, it is worth noting that it is factually indisputable that many, many unarmed black men have been killed by police under questionable circumstances.

We are a country that prides itself on individual rights, inalienable rights. Rights that can never, ever, be taken away from us.

Unless you are black.

Because he was black, he automatically had no rights.

Presidential Debate Quick Take

While I could endlessly pontificate about last night’s presidential debate, there’s not much I could add that hasn’t already been said by the many, many pundits and posters covering this race.

So I decided for today to do a very quick analysis of the debate transcript, as provided by the New York Times.

The transcript captures three speakers – Clinton, Trump, and moderator Lester Holt; and three interactions – crosstalk, laughter, and applause. The audience was under clear instructions to neither laugh nor applaud, but they did so anyway, getting, I think, a bit rowdier as the night went on.

The transcript watched 5 instances of audience laughter – 4 in response to Clinton and 1 in response to Trump (“I also have a much better temperament than she has, you know?”). Of the 12 instances of applause, 4 were in response to the moderator, 3 were in response to Clinton, and 5 were in response to Trump.

For crosstalk, the meaning is a little less clear – crosstalk is marked after 4 Trump comments, 3 Holt comments, and 1 Clinton comment…but this doesn’t explicitly indicate who was the actual interrupter.

While some have argued that Holt did an insufficient job of keeping time, Clinton and Trump did have about equal coverage – at least in terms of word count. Clinton spoke slightly less, using a total of 2403 words to Trump’s 2951. Interestingly, Clinton used more unique words – 788 to Trump’s 730.

And if you’re wondering, Lester Holt spoke a total of 1431 words, 481 of which were unique.

Using a simple log-likelihood technique, we can look at which words are most distinctive by speaker. That is, by comparing the frequency of words in one speaker’s text to the full transcript, we can see which words are over represented in that subsample.

In the role of moderator, for example, we see that Holt was much more likely to use words like “Mr”, “question”, “segment” and “minutes.”

Typically, you’d use log-likelihood on a much larger corpus, but it can still be fun for a single debate transcript.

Among Clinton’s most distinctive words were: “right”, “war”, and “country”

Among Trump’s most distinctive words were: “business”, “new”, and “judgment”. (Note that “bigly” does not appear in the transcript, since he actually said “big league”.)

This is a very rudimentary text analysis, but its still interesting to think about what we can learn from these simple assessments.

Big Tent Social Justice

I’ve been trained to think like a marketer, and I tend, at times, to think of social justice efforts through this lens too.

That is, if you’re trying to bring about behavior change among a large portion of the population, what communication strategies and tactics do you use to bring about this change? This way of thinking is somewhat distasteful given the manipulative reputation of marketing as a profession, but I find it useful nonetheless.

From this perspective, the strategy of a social justice movement would be to appeal to the largest possible number of people – to welcome everyone under a “big tent” vision of the cause. If this is your goal, then the strategy becomes relatively straightforward: create messages with broad appeal, take actions which generate sympathy, in all things go for the broadest reach and broadest appeal possible.

This is all very reasonable from a marketing perspective.

However, there’s a problem with this approach: the bigger your tent, the more diluted your vision. The more you try to please a broad group of people, the more you will have to relax your core stance.

This balance applies to any issue, not just social justice. Robert Heinlein used to argue that it was impossible to make a decision if more than 3 people are involved. Any time you have a large number of people in one place, the number of things they can really, deeply agree to will be minimal.

If you’re a marketer trying to maximize your profits, find the right balance takes skill but is relatively straightforward: appeal to the largest number of people possible while also creating a coherent brand identity. There’s a trade-off between the two, but no real sacrifice either way.

The calculation is more complex when it comes to social justice: just how much are you willing to let go?

This is an important question with a non-trivial answer: appeal to many people and you increase your chances of accomplishing something – but you also make it more likely that what you accomplish will be a toothless, meaningless shadow of your original goal.

There are varied opinions on which side of this spectrum it’s better to be on, and there’s no easy answer. When doing nothing is disastrous, is it better to accomplish something ineffective or to accomplish nothing at all?

Perhaps doing something is better than doing nothing; or perhaps an empty victory only serves to alleviate the sense that something needs to be done – making it virtually impossible for any real change to occur.

I don’t have an answer to this question – certainly not a generalizable one which could be applied to any issue at any time. But I do think that both arguments are reasonable – that we must appreciate the efforts of all who strive towards social justice and to value their input and perspective – even when we disagree.

Representing the Structure of Data

To be perfectly honest, I had never thought much about graph layout algorithms. You hit a button in Gephi or call a networkx function, some magic happens, and you get a layout. If you don’t like the layout generated, you hit the button again or call a different function.

In one of my classes last year, we generated our own layouts using eigenvectors of the Laplacian. This gave me a better sense of what happens when you use a layout algorithm, but I still tended to think of it as a step which takes place at the end of an assignment; a presentation element which can make your research accessible and look splashy.

In my visualization class yesterday, we had a guest lecture by Daniel Weidele, PhD student at University of Konstanz and researcher at IBM Watson. He covered fundamentals of select network layout algorithms but also spoke more broadly about the importance of layout. A network layout is more than a visualization of a collection of data, it is the final stage of a pipeline which attempts to represent some phenomena. The whole modeling process abstracts a phenomena into a concept, and the represents that concept as a network layout.

When you’re developing a network model for a phenomenon, you ask questions like “who is your audience? What are the questions we hope to answer?” Daniel pointed out that you should ask similar questions when evaluating a graph layout; the question isn’t just “does this look good?” You should ask: “Is this helpful? What does it tell me?”

If there are specific questions you are asking your model, you can use a graph layout to get at the answers. You may, for example, ask: “Can I predict partitioning?”

This is what makes modern algorithms such as stress optimization so powerful – it’s not just that they produce pretty pictures, or even that they layouts appropriately disambiguate nodes, but they actually represent the structure of the data in a meaningful way.

In his work with IBM Watson, Weidele indicated that a fundamental piece of their algorithm design process is building algorithms based on human perception. For a test layout, try to understand what a human likes about it, try to understand what a human can infer from it – and then try to understand the properties and metrics which made that human’s interpretation possible.

Large Graph Layout Algorithms

Having previously tried to use force-directed layout algorithms on large networks, I was very intrigued by

Stefan Hachul and Michael Junger’s article Large Graph-Layout Algorithms at Work: An Experimental Study. In my experience, trying to generate a layout for a large graph results in little more than a hairball and the sense that one really ought to focus on just a small subgraph.

With the recent development of increasingly sophisticated layout algorithms, Hachul and Junger compare the performance of several classical and more recent algorithms. Using a collection graphs – some relatively easy to layout and some more challenging – the authors compare the runtime and aesthetic output.

All the algorithms strive for the same aesthetic properties: uniformity of edge length, few edge crossings, non-overlapping nodes and edges, and the display of symmetries – which makes aesthetic comparison measurable.

Most of the algorithms performed well on the easier layouts. The only one which didn’t was their benchmark Grid-Variant Algorithm (GVA), a spring-layout which divides the drawing area into a grid and only calculates the repulsive forces acting between nodes that are placed relatively near to each other.

For the harder graphs, they found that the Fast Multipole Multilevel Method (FM3) often produced the best layout, though it is slower than High-Dimensional Embedding (HDE) and the Algebraic Multigrid Method (ACE), which can both produce satisfactory results. Ultimately, Hachul and Junger recommend as practical advice: “first use HDE followed by ACE, since they are the fastest methods…if the drawings are not satisfactory or one supposes that important details of the graph’s structure are hidden, use FM3.”

What’s interesting about this finding is that HDE and ACE both rely solely on linear algebra rather than the physical analogies of force-directed layouts. FM3, on the other hand – notably developed by Hachul and Junger – is force-directed.

In ACE, the algorithm minimizes the quadratic form of the Laplacian (xTLx), finding the eigenvectors of L that are associated with the two smallest eigenvalues. Using an algebraic multigrid algorithm to calculate the eigenvectors makes the algorithm among the fastest tested for smaller graphs.

By far the fastest algorithm was HDE, which takes a really interesting, two-step approach. First approximating a high-dimensional k-clustering solution and then projecting those clusters into 2D space by calculating the eigenvectors of the covariance matrix from the clusters. The original paper describing the algorithm is here.

Finally, the slower but more aesthetically reliable FM3 algorithm improves upon classic force-direct approaches by relying on an important assumption: in large graphs, you don’t necessarily have to see everything. In this algorithm, “subgraphs with a small diameter (called solar systems) are collapsed” resulting in a final visualization which captures the structure of the large network with the visual ease of a smaller network.

Family Migration

My mother is big genealogy enthusiast. Her enthusiasm, however, is unmatched by many in our family. I mean, we’re glad someone‘s doing the research, but we don’t seem to get that same spark of awe which for her is so inherent to the process.

So she wrote this thing, exploring what we learn from genealogy; the individual effects of the great sweeps of history.

I share it here with permission.


Much of what we can know of our ancestors has to be suppositioned by our knowledge of history and prevalent customs of their times.  The lives of most of these people did not make a big splash.  Many were illiterate and, as a result, did not leave much documentation of their experiences.  Life was frequently short due to lack of medical discoveries that are so much a part of our current lives.  But I would suspect that they lived more in the moment than we do now.  So much was out of their control, they had little choice but to focus on the things that were occurring, leaving the rest in the hands of God. Given the difficulties of sustaining life, as an aggregate, what they accomplished was remarkable.

Each era had its own agenda, but the commonalities remained the same for several hundred years of our country’s history. People married young, reproduced prolifically, and attempted to provide for their children.  Many of the accomplishments were a result of this drive. Starting with the Pilgrims and the indentured servants of the South, our ancestors were seeking an environment that would provide opportunities to build lives for themselves and their progeny.  The hardships they endured were more than many of us could even consider, but the conditions that existed in their current lives spurred them forward. Whether it was religious convictions or abject poverty, they all made the leap.  They got on a boat with little real knowledge of what they would meet and forged our future. The heroes are the many people who will never appear on a family tree because they died in the process of trying to reach this goal.  It was the sheer numbers of people who would make this attempt that made our ancestors successful. It created a movement that would be reproduced in 1700’s and the 1800’s and would populate this country with Europeans.

Family life could have many variables, but the notion of childhood is very recent idea. Children were loved and cherished, but expected to participate in sustaining livelihood from a very early age. Frequently their mothers, and sometimes their fathers, were nothing but children themselves. Many people had multiple marriages due to the loss of a spouse. You could not remain unmarried once you had begun family building. It was impossible to continue the care of your children and the growth of your farm without two adults, and, in many cases, young adults to keep up the work. The family home would be small and bursting with people.  Every space would be a bedroom at night.  This caused the older children at a young age to look to new options. Friends and relatives who had moved further west would encourage these young people to join them.  Single men and women would be told that there were many marriageable people waiting to find mates.  Young couples would frequently be intermarried with sisters or brothers of a specific family with the thought that when they emigrated they would remain together. The reality of this westward movement was the disintegration of the original family unit.

Communication would be difficult. In this era of cell phones and e-mail, it is hard to imagine the absolute impossibility of keeping in touch.  Frequently, several siblings would end up in the same location over time.  The first arrivals would help their siblings establish themselves, and then they set to the task of recreating a new network of relations. The ones left at home were usually so young that they had little feeling for their older siblings.  If the family was large enough, another move would be made by the younger children resulting in a separate conglomeration in a new location. Frequently an aged parent would die in the home of a child in the second migration.

There were, of course, some people who remained in the same location generation after generation.  When you reconstruct the families through genealogy, they are our cousins.  Our ancestors were the ones on the move.

Design Aesthetic and Chart Junk

In my visualization class today, we had a guest lecture by Michelle Borkin, another Northeastern professor who works in the field of information and scientific visualization. She gave us a great overview of the foundational design aesthetics of Edward Tufte.

Whether you know him by name or not, you may be familiar with some of his principles. He writes extensively about “graphical integrity,” highlighting the importance of clearly labeling of data and cautioning against distorted or misleading axes. But, perhaps more fundamentally, the Tufte-ian mantra seems to be summed in one word: simplify.

Tufte advocates for removing as much extraneous ink as possible. Non-data ink should be minimized as much of possible; clearing away the clutter and letting the data speak for themselves.

Generally, his arguments make sense – there’s no need to create a 3D bar-chart just because Microsoft Office says that you can. But in this day of infographics and data journalism, Tufte’s style can seem rather…dull.

This has led to a great debate over chart junk: a topic so real it has its own wikipedia page. “Chart junk” refers to any element of a visualization which doesn’t explicitly need to be there – elements which may make the visualization more interesting, but which don’t directly convey the data. The term was actually coined by Tufte, who, as you may have guessed, was adamantly anti-chart junk.

Recent research, though, has shown that “chart junk” isn’t necessarily inherently bad. Infographics and other visualizations designed for broad public consumption may not have the precision of a scientific visualizations, but they are more memorable and impactful.

Is chart junk okay? The answer, I guess, depends entirely on the audience, the task, and the context.

The Effects of Interactive Latency on Exploratory Visual Analysis

In their paper, Zhicheng Liu and Jeffrey Heer explore “The Effects of Interactive Latency on Exploratory Visual Analysis” – that is, how user behavior changes with system response time. As the authors point out, while it seems intuitively ideal to minimize latency, effects vary by domain.

In strategy games, “latency as high as several seconds does not significantly affect user performance,” most likely because tasks which “take place at a larger time scale,” such as “understanding game situation and conceiving strategy” play a more important role in affecting the outcome of a game. In a puzzle game, imposed latency caused players to solve the puzzle in fewer moves – spending more time mentally planning their moves.

These examples illustrate perhaps the most interesting aspect of latency: while it’s often true that time delays will make users bored or frustrated, that is not the only dimension of effect. Latency can alter the way a user thinks about a problem; consciously or unconsciously shifting strategies to whatever seems more time effective.

Liu and Heer focus on latency effecting “knowledge discovery with visualizations,” a largely unexplored area. One thing which makes this domain unique is that “unlike problem-solving tasks or most computer games, exploratory visual analysis is open-ended and does not have a clear goal state.”

The authors design an experimental setup in which participants are asked to explore two different datasets and “report anything they found interesting, including salient patterns in the visualizations, their interpretations, and any hypotheses based on those patterns.” Each participant experienced an additional 500ms latency in one of the datasets. They recorded participant mouse clicks, as well as 9 additional “application events,” such as zoom and color slider, which capture user interaction with the visualization.

The authors also used a “think aloud protocol” to capture participant findings. As the name implies, a think aloud methodology asks users to continually describe what they are thinking as they work. A helpful  summary of the benefits and downsides of this methodology can be found here.

Liu and Heer find that latency does have significant effects: latency decreased user activity and coverage of the dataset, while also “reducing rates of observation, generalization and hypothesis.” Additionally, users who experienced the latency earlier in the study had “reduced rates of observation and generalization during subsequent analysis sessions in which full system performance was restored.”

This second finding lines up with earlier research which found that a delay of 300ms in web searches reduced the number of searches a user would perform – a reduction which would persist for days after latency was restored to previous levels.

Ultimately, the authors recommend “taking a user-centric approach to system optimization” rather than “uniformly focusing on reducing latency” for each individual visual operation.

Democratic Distributions

Gaussian, Poisson, and other bell-shaped distributions are some times called “democratic.” This colloquial term is intended to indicate an important feature: an average value is a typical value.

Compare this to heavy-tailed distributions which follow generally the so-called 80/20 rule: 80% of your business comes from 20% of your clients, 80% of the wealth is controlled by 20% of the population. Indeed, this principle was originally illustrated by Italian economist Vilfredo Pareto when he demonstrated that 80% of the land in Italy was owned by 20% of the population.

In these distributions, an average value is not typical: the average household income doesn’t mean much when a small group of people are vastly more wealthy than the rest. This skew can be shown mathematically: in a bell curve, the variance – which measures the spread of a distribution – is well defined, while it diverges for a heavy-tailed distribution.

Yet while heavy-tailed distributions are clearly not democratic, I’m still struck by the use of the term for normal distributions. I’m not sure I’d call those distributions democratic either.

I’m particularly intrigued by the use of the word “democratic” to nod to the idea of things being the same. Indeed, such bell-shaped distributions are known primarily for being statistically homogeneous.

That’s starting to border on some Harrison Bergeron imagery, with a Handicapper General tasked with making sure that no outliers are too intelligent or too pretty.

That’s not democratic at all. Not really.

This, of course, leads me to the question: what would a “democratic” distribution really look like?

I don’t have a good answer for that, but this does raise an broader point about democracy: most real-world systems are heavy-tailed. Properties like hight and weight follow normal distributions, but power, money, and fame are heavy-tailed.

So the real question isn’t what a democratic distribution looks like; it is how do we design a democratic system in a complex system that is inherently undemocratic?