Frequently Asked Questions

General

Basic usage

Deeper questions

Troubleshooting


General Questions

Q: What is this?

Semantics-preserving word cloud visulization tool. A word cloud consists of the most important words in that document. Each word is printed in a given font and scaled by a factor roughly proportional to its importance. The printed words are arranged without overlap and tightly packed into rectangular shape. Your word clouds can be tweaked with various fonts, layouts, and color schemes. Read the description for more details.

How is it different from Wordle?

Our clouds are semantics-aware. In Wordle (or other similar tool), the placement of the words is completely independent of their context and the coloring is random. However, when visualizing the given text with a word cloud, it is possible to automatically identify groups of semantically related words, or the major topics in the input text. Our system places similar words close to each other and assigns them the same color. This simplifies visual analysis of the text. See an example of a random (left) and semantic (right) word placement.

How do you make a word cloud?

The input text is parsed and tokenized into a collection of words. The common stop-words ("a", "the", "is") are removed, while the remaining words are grouped by their stems. Then the words are ranked in order of their importance in the text and the font size for every word is calculated. Next semantic similarities between pairs of the words are computed, based on co-occurrence in the same sentences. Then similar words are grouped together and the groups get different colors. Finally, we layout the words with an algorithm, which employs a theoretical model for computing semantic word clouds.

Basic usage

Q: Can I save the word cloud as jpeg/gif/png/svg/pdf?

A cloud is generated as SVG file, which can be downloaded by using a link at the top-right corner of the cloud. You may also download PNG and PDF versions of the file. In order to convert it to another format, use a vector graphics editor, like Inkscape.

Q: Is there a way to edit the cloud once it is created?

Yes! You may drag and reposition the words on the screen. Actually, you can even remove words, by right-clicking on them and using the popup menu. It is also possible to re-layout the cloud by pressing Apply New Options button.

Q: Can you visualize a webpage or a blog?

You may enter any text, URL of a blog or a webpage. Maximum size for pasting is 500 kilobytes but mostly depends on your browser and computer.

Q: What about my document? How about Twitter? Reddit? Google? YouTube?

We can create a cloud using several sources of the input text:

Q: Does it work for non-English languages?

The tool supports Unicode and many languages. You may choose the language of your source text in the Advanced Options section. The chosen language affects the way how the input text is parsed: splitting into sentences, tokenization into words, and stop-word removal. Please note that some of the features are not supported for non-English languages; for example, the TF-IDF ranking can only work for English, as it utilizes Brown corpus.

Deeper questions

Q: How do you compute similarities between the words?

Given the list of words, we calculate a matrix of pairwise similarities so that related words receive high similarity values. We use three similarity functions: Cosine Similarity, Jaccard Similarity, and Lexical Similarity.

Q: How do you determine sizes of the words?

We rank the words in order of their importance in the input text. We use three different ranking functions. Term Frequency is the most basic ranking function and one used in most traditional word cloud visualizations. Term frequency tends to rank highly many semantically meaningless words. Term Frequency-Inverse Document Frequency addresses this problem by normalizing the frequency of a word by its frequency in a larger text collection. The third ranking function is based on the LexRank algorithm, which is a graph-based method for computing relative importance of textual units using eigenvector centrality.

Q: How do you position the words?

We provide several algorithms for computing a layout of the words:

Note that the last two algorithms always produce provably near-optimal results, though sometimes the clouds are not as pleasant as the clouds generated with heuristics.

Q: How do you color the words?

We cluster the words according to their semantic meaning, and then use different colors for the computed clusters. Thus, semantically related groups of words (e.g., gene, protein, disease, metabolism) are likely to have the same coloring. This is an intuitive way to visually identify major topics in the input text. To identify the clusters, we employ the modularity-based algorithm.

Q: I want to know more about your algorithms!

The system relies on a geometric problem behind drawing word clouds, which is described in a series of research papers:

Q: May I see the source code?

Yes! Source code for the entire system is available on Github. The system also comes as a command-line tool: download cloudy.jar and invoke "java -jar cloudy.jar [options] [input file]" (without quotes). The available options can be printed by running "java -jar cloudy.jar -?".

Troubleshooting

Q: Who are the people behind the system?

The tool is supported by a group of researchers and developers at the University of Arizona.

Q: How can I contact you?

Email at team1@cs.arizona.edu for questions and suggestions.