System Overview

In the last few years, word clouds have become a standard tool for abstracting, visualizing, and comparing text documents. For example, Word clouds were used in 2008 to contrast the speeches of then US presidential candidates Obama and McCain. A word cloud of a given document consists of the most important (or most frequent) words in that document. Each word is printed in a given font and scaled by a factor roughly proportional to its importance (the same is done with the names of towns and cities on geographic maps, for example). The printed words are arranged without overlap and tightly packed into some shape (usually a rectangle).

Many practical tools, like Wordle, with its high quality design, graphics, style and functionality popularized word cloud visualizations as an appealing way to summarize the content of a webpage, a research paper, or a political speech. While similar tools are popular, most of them have a potential shortcoming: They do not visualize the relationships between the words in any way, as the placement of the words is completely independent of their context. But humans, as natural pattern-seekers, cannot help but perceive two words that are placed next to each other in a word cloud as being related in some way. In linguistics and in natural language processing if a pair of words often appears together in a sentence, then this is seen as evidence that this pair of words is linked semantically. When visualizing the given text with a word cloud, it makes sense to place such related pair of words close to each other. It helps to visually identify major topics in the input text.

Word clouds generated from titles of papers from FOCS, 1993-2013. left: The result produced by the Wordle tool: word placement, orientation, and colors are chosen arbitrarily; right: Semantics-preserving word cloud: semantically related words are drawn together and colored according to the automatically extracted clusters.

Word Cloud Generation

The system creates word clouds using several sources of textual data. The simpliest source is a text document entered by a user. Users may also specify the URL of a webpage or the link to a PDF document. In this case, a word cloud is constructed based on the extracted text. Another option is to specify the link to a YouTube video or a Reddit discussion. For the scenario, the system parses all comments for the video and produces a "comment cloud".
Before constructing a word cloud, the input text is preprocessed using the following steps:

Once terms, rankings, and similarities are extracted, we create an edge-weghted graph in which vertices correspond to the terms and weights correspond to the computed similarities. For each word, an axis-aligned rectangle is created with height proportional to its rank. Then one of the algorithms for constructing contact representations of the graph is applied. The result is a set of non-overlapping positions for the rectangles with semantically related words laid out close to each other.

Algorithms and Implementation

The system relies on a geometric model behind drawing word clouds in which sementically related words are placed close to each other. The formal model is described in a series of research papers:

Source code for the entire system is available on Github