Semantics-preserving word cloud visulization tool. A word cloud consists of the most important words in that document. Each word is printed in a given font and scaled by a factor roughly proportional to its importance. The printed words are arranged without overlap and tightly packed into rectangular shape. Your word clouds can be tweaked with various fonts, layouts, and color schemes. Read the description for more details.
Our clouds are semantics-aware. In Wordle (or other similar tool), the placement of the words is completely independent of their context and the coloring is random. However, when visualizing the given text with a word cloud, it is possible to automatically identify groups of semantically related words, or the major topics in the input text. Our system places similar words close to each other and assigns them the same color. This simplifies visual analysis of the text. See an example of a random (left) and semantic (right) word placement.
The input text is parsed and tokenized into a collection of words. The common stop-words ("a", "the", "is") are removed, while the remaining words are grouped by their stems. Then the words are ranked in order of their importance in the text and the font size for every word is calculated. Next semantic similarities between pairs of the words are computed, based on co-occurrence in the same sentences. Then similar words are grouped together and the groups get different colors. Finally, we layout the words with an algorithm, which employs a theoretical model for computing semantic word clouds.
A cloud is generated as SVG file, which can be downloaded by using a link at the top-right corner of the cloud. You may also download PNG and PDF versions of the file. In order to convert it to another format, use a vector graphics editor, like Inkscape.
Yes! You may drag and reposition the words on the screen. Actually, you can even remove words, by right-clicking on them and using the popup menu. It is also possible to re-layout the cloud by pressing Apply New Options button.
You may enter any text, URL of a blog or a webpage. Maximum size for pasting is 500 kilobytes but mostly depends on your browser and computer.
We can create a cloud using several sources of the input text:
The tool supports Unicode and many languages. You may choose the language of your source text in the Advanced Options section. The chosen language affects the way how the input text is parsed: splitting into sentences, tokenization into words, and stop-word removal. Please note that some of the features are not supported for non-English languages; for example, the TF-IDF ranking can only work for English, as it utilizes Brown corpus.
Given the list of words, we calculate a matrix of pairwise similarities so that related words receive high similarity values. We use three similarity functions: Cosine Similarity, Jaccard Similarity, and Lexical Similarity.
We rank the words in order of their importance in the input text. We use three different ranking functions. Term Frequency is the most basic ranking function and one used in most traditional word cloud visualizations. Term frequency tends to rank highly many semantically meaningless words. Term Frequency-Inverse Document Frequency addresses this problem by normalizing the frequency of a word by its frequency in a larger text collection. The third ranking function is based on the LexRank algorithm, which is a graph-based method for computing relative importance of textual units using eigenvector centrality.
We provide several algorithms for computing a layout of the words:
We cluster the words according to their semantic meaning, and then use different colors for the computed clusters. Thus, semantically related groups of words (e.g., gene, protein, disease, metabolism) are likely to have the same coloring. This is an intuitive way to visually identify major topics in the input text. To identify the clusters, we employ the modularity-based algorithm.
The system relies on a geometric problem behind drawing word clouds, which is described in a series of research papers:
Yes! Source code for the entire system is available on Github. The system also comes as a command-line tool: download cloudy.jar and invoke "java -jar cloudy.jar [options] [input file]" (without quotes). The available options can be printed by running "java -jar cloudy.jar -?".
The tool is supported by a group of researchers and developers at the University of Arizona.
Email at firstname.lastname@example.org for questions and suggestions.