In the last few years, word clouds have become a standard tool for abstracting, visualizing, and comparing
text documents. For example, Word clouds were used in 2008 to contrast the speeches of then US
presidential candidates Obama and McCain. A word
cloud of a given document consists of the most important (or most frequent) words in that document.
Each word is printed in a given font and scaled by a factor roughly proportional to its importance (the
same is done with the names of towns and cities on geographic maps, for example). The printed words
are arranged without overlap and tightly packed into some shape (usually a rectangle).
Many practical tools, like Wordle, with its high quality design, graphics, style and functionality popularized
word cloud visualizations as an appealing way to summarize the content of a webpage, a research paper,
or a political speech. While similar tools are popular, most of them have a potential
shortcoming: They do not visualize the relationships between the words in any way, as the placement of
the words is completely independent of their context. But humans, as natural pattern-seekers, cannot help
but perceive two words that are placed next to each other in a word cloud as being related in some way.
In linguistics and in natural language processing if a pair of words often appears together in a sentence,
then this is seen as evidence that this pair of words is linked semantically. When visualizing the given
text with a word cloud, it makes sense to place such related pair of words close to each other.
It helps to visually identify major topics in the input text.
Word clouds generated from titles of papers from FOCS, 1993-2013. left: The result produced by the Wordle tool: word placement, orientation, and colors are chosen arbitrarily; right: Semantics-preserving word cloud: semantically related words are drawn together and colored according to the automatically extracted clusters.
The system creates word clouds using several sources of textual data. The simpliest source is a text
document entered by a user. Users may also specify the URL of a webpage or the link to a PDF document. In this case,
a word cloud is constructed based on the extracted text. Another option is to specify the link to a YouTube video or a Reddit discussion.
For the scenario, the system parses all comments for the video and produces a "comment cloud".
Before constructing a word cloud, the input text is preprocessed using the following steps:
The system relies on a geometric model behind drawing word clouds in which sementically related words are placed close to each other. The formal model is described in a series of research papers: