Semantic Word Cloud Visualization

Frequently Asked Questions

General

What is this?
How is it different from Wordle?
How do you make a word cloud?

Basic usage

Can I save the word cloud as jpeg/gif/png/svg/pdf?
Is there a way to edit the cloud once it is created?
Can you visualize a webpage or a blog?
What about my document? How about Twitter? Reddit? Google? YouTube?
Does it work for non-English languages?

Deeper questions

How do you compute similarities between the words?
How do you determine sizes of the words?
How do you position the words?
How do you color the words?
I want to know more about your algorithms!
May I see the source code?

Troubleshooting

Who are the people behind the system?
How can I contact you?

General Questions

Q: What is this?

Semantics-preserving word cloud visulization tool. A word cloud consists of the most important words in that document. Each word is printed in a given font and scaled by a factor roughly proportional to its importance. The printed words are arranged without overlap and tightly packed into rectangular shape. Your word clouds can be tweaked with various fonts, layouts, and color schemes. Read the description for more details.

How is it different from Wordle?

Our clouds are semantics-aware. In Wordle (or other similar tool), the placement of the words is completely independent of their context and the coloring is random. However, when visualizing the given text with a word cloud, it is possible to automatically identify groups of semantically related words, or the major topics in the input text. Our system places similar words close to each other and assigns them the same color. This simplifies visual analysis of the text. See an example of a random (left) and semantic (right) word placement.

How do you make a word cloud?

The input text is parsed and tokenized into a collection of words. The common stop-words ("a", "the", "is") are removed, while the remaining words are grouped by their stems. Then the words are ranked in order of their importance in the text and the font size for every word is calculated. Next semantic similarities between pairs of the words are computed, based on co-occurrence in the same sentences. Then similar words are grouped together and the groups get different colors. Finally, we layout the words with an algorithm, which employs a theoretical model for computing semantic word clouds.

Basic usage

Q: Can I save the word cloud as jpeg/gif/png/svg/pdf?

A cloud is generated as SVG file, which can be downloaded by using a link at the top-right corner of the cloud. You may also download PNG and PDF versions of the file. In order to convert it to another format, use a vector graphics editor, like Inkscape.

Q: Is there a way to edit the cloud once it is created?

Yes! You may drag and reposition the words on the screen. Actually, you can even remove words, by right-clicking on them and using the popup menu. It is also possible to re-layout the cloud by pressing Apply New Options button.

Q: Can you visualize a webpage or a blog?

You may enter any text, URL of a blog or a webpage. Maximum size for pasting is 500 kilobytes but mostly depends on your browser and computer.

Q: What about my document? How about Twitter? Reddit? Google? YouTube?

We can create a cloud using several sources of the input text:

URL of a document - Link to the file posted on the web. Plain text and PDF documents are accepted.
Twitter - Enter "twitter: search_query" (without quotes) to parse and create a word cloud for 100 most recent tweets returned by Twitter Search for a given query. You may customize the results by specifying search operators. The query "twitter: graph drawing size:500 type:popular lang:en include:retweets" builds a word cloud for 500 most popular tweets and retweets written in English containing the words 'graph' and 'drawing'.
Google search results - Enter "google: search_query" (without quotes) to parse and visualize search results. Use "google: query size:50" to specify the number of results.
YouTube video comments - Use a link to a YouTube video to get a cloud for the comments.
Reddit comments - Give us a link to a Reddit thread and we'll parse, extract, and visualize the top 500 comments.

Q: Does it work for non-English languages?

The tool supports Unicode and many languages. You may choose the language of your source text in the Advanced Options section. The chosen language affects the way how the input text is parsed: splitting into sentences, tokenization into words, and stop-word removal. Please note that some of the features are not supported for non-English languages; for example, the TF-IDF ranking can only work for English, as it utilizes Brown corpus.

Deeper questions

Q: How do you compute similarities between the words?

Given the list of words, we calculate a matrix of pairwise similarities so that related words receive high similarity values. We use three similarity functions: Cosine Similarity, Jaccard Similarity, and Lexical Similarity.

Q: How do you determine sizes of the words?

We rank the words in order of their importance in the input text. We use three different ranking functions. Term Frequency is the most basic ranking function and one used in most traditional word cloud visualizations. Term frequency tends to rank highly many semantically meaningless words. Term Frequency-Inverse Document Frequency addresses this problem by normalizing the frequency of a word by its frequency in a larger text collection. The third ranking function is based on the LexRank algorithm, which is a graph-based method for computing relative importance of textual units using eigenvector centrality.

Q: How do you position the words?

We provide several algorithms for computing a layout of the words:

Wordle - the traditional heuristic for creating word clouds
Tag Cloud - the grid-based method for arranging words, that can be sorted by their importance or alphabetically
Force-Directed - the method for positioning the words based on multidimensional scaling
Context Preserving - the algorithm suggested by Cui et al. based on modifying Delaunay triangulation
Inflate and Push - the simpler version of the above method
Seam Carving - the algorithm designed by Wu et al., which is based on an image resizing technique
Star Forest - the theoretical approximation algorithm for computing semantic word clouds
Cycle Cover - another theoretical approximation algorithm for computing semantic word clouds

Note that the last two algorithms always produce provably near-optimal results, though sometimes the clouds are not as pleasant as the clouds generated with heuristics.

Q: How do you color the words?

We cluster the words according to their semantic meaning, and then use different colors for the computed clusters. Thus, semantically related groups of words (e.g., gene, protein, disease, metabolism) are likely to have the same coloring. This is an intuitive way to visually identify major topics in the input text. To identify the clusters, we employ the modularity-based algorithm.

Q: I want to know more about your algorithms!

The system relies on a geometric problem behind drawing word clouds, which is described in a series of research papers:

Semantic Word Cloud Representations: Hardness and Approximation Algorithms, Latin American Theoretical INformatics (LATIN'14), pp. 514-525, 2014.
Experimental Comparison of Semantic Word Clouds, Symposium on Experimental Algorithms (SEA'14), pp. 247-258, 2014.
Improved Approximation Algorithms for Semantic Word Clouds, European Symposium on Algorithms (ESA'14), pp. 87-99, 2014.

Q: May I see the source code?

Yes! Source code for the entire system is available on Github. The system also comes as a command-line tool: download cloudy.jar and invoke "java -jar cloudy.jar [options] [input file]" (without quotes). The available options can be printed by running "java -jar cloudy.jar -?".

Troubleshooting

Q: Who are the people behind the system?

The tool is supported by a group of researchers and developers at the University of Arizona.

Q: How can I contact you?

Email at team1@cs.arizona.edu for questions and suggestions.