Semantic Word Cloud Visualization

System Overview

In the last few years, word clouds have become a standard tool for abstracting, visualizing, and comparing text documents. For example, Word clouds were used in 2008 to contrast the speeches of then US presidential candidates Obama and McCain. A word cloud of a given document consists of the most important (or most frequent) words in that document. Each word is printed in a given font and scaled by a factor roughly proportional to its importance (the same is done with the names of towns and cities on geographic maps, for example). The printed words are arranged without overlap and tightly packed into some shape (usually a rectangle).

Many practical tools, like Wordle, with its high quality design, graphics, style and functionality popularized word cloud visualizations as an appealing way to summarize the content of a webpage, a research paper, or a political speech. While similar tools are popular, most of them have a potential shortcoming: They do not visualize the relationships between the words in any way, as the placement of the words is completely independent of their context. But humans, as natural pattern-seekers, cannot help but perceive two words that are placed next to each other in a word cloud as being related in some way. In linguistics and in natural language processing if a pair of words often appears together in a sentence, then this is seen as evidence that this pair of words is linked semantically. When visualizing the given text with a word cloud, it makes sense to place such related pair of words close to each other. It helps to visually identify major topics in the input text.

Word clouds generated from titles of papers from FOCS, 1993-2013. left: The result produced by the Wordle tool: word placement, orientation, and colors are chosen arbitrarily; right: Semantics-preserving word cloud: semantically related words are drawn together and colored according to the automatically extracted clusters.

Word Cloud Generation

The system creates word clouds using several sources of textual data. The simpliest source is a text document entered by a user. Users may also specify the URL of a webpage or the link to a PDF document. In this case, a word cloud is constructed based on the extracted text. Another option is to specify the link to a YouTube video or a Reddit discussion. For the scenario, the system parses all comments for the video and produces a "comment cloud".
Before constructing a word cloud, the input text is preprocessed using the following steps:

Term Extraction: The input text is first split into sentences, which are then tokenized into a collection of words using Apache OpenNLP. Common stop-words such as "a", "the", "is" are removed from the collection. The remaining words are grouped by their stems using the Porter Stemming Algorithm, so that related words like "dance", "dancer", and "dancing" are reduced to their root, "danc". The most common variation of the word is used in the final word cloud.
Ranking: In the next step we rank the words in order of relative importance. We use three different ranking functions, depending on word usage in the input text. Each ranking function orders words by their assigned weight (rank). Term Frequency is the most basic ranking function and one used in most traditional word cloud visualizations. Even after removing common stop-words, term frequency tends to rank highly many semantically meaningless words. Term Frequency-Inverse Document Frequency addressed this problem by normalizing the frequency of a word by its frequency in a larger text collection. The third ranking function is based on the LexRank algorithm. The algorithm is a graphbased method for computing relative importance of textual units using eigenvector centrality.
Similarity Computation: Given the ranked list of words, we calculate a matrix of pairwise similarities so that related words receive high similarity values. We use three similarity functions depending on the input text: Cosine Similarity, Jaccard Similarity, and Lexical Similarity. In all cases for all pairs of words, the similarity function produces a value between 0, indicating that a pair of words is not related, and 1, indicating that words are very similar.

Once terms, rankings, and similarities are extracted, we create an edge-weghted graph in which vertices correspond to the terms and weights correspond to the computed similarities. For each word, an axis-aligned rectangle is created with height proportional to its rank. Then one of the algorithms for constructing contact representations of the graph is applied. The result is a set of non-overlapping positions for the rectangles with semantically related words laid out close to each other.

Algorithms and Implementation

The system relies on a geometric model behind drawing word clouds in which sementically related words are placed close to each other. The formal model is described in a series of research papers:

Source code for the entire system is available on Github