Cloud Words
I have thrown together a simple proof of concept (POC) in cakePHP for displaying cloud words. What are cloud words? Word frequencies displayed in a cloud similar to a tag cloud. A tag cloud may be found on right column of this blog site. The POC may be tested here.
What does it do?
Cloud Words allows for the upload and analysis of plain text and MS Word files. MS Word files are converted to plain text for the analysis. The analysis simply does a count total for each word in the file. A simple word count is actually not very useful since the most common words will dominate the count list. The most common words are generally not very useful for understanding the contents of a document. “THE” is the single word found most often in documents.
In order to improve the usefulness of the word count, some filtering must be done. For this POC, I am only applying a single filter though for real-life usage I would recommend additional filters. The filter used in this POC removes the 200 most common words from the word count. This count is saved and associated with the file.
The clouds are created from the word counts. A composite cloud, which sums the counts from all the files, displays the top 100 words in a cloud. Each file’s cloud displays the top 50 words. In the cloud, the order of the words is randomized and the size of the word is mathematically related to its count total.
Clicking a word shows all files that word is found in. The files are listed in word frequency order and the count is displayed.
Best comparison usage
This is a very simple solution and therefore it is very easy to get poor or unbalanced results. For the best results, follow these suggestions:
- Documents should be of similar length
- Documents should have a similar formatting purpose (letter vs report)
- Each document should have a reasonable word distribution
To illustrate my meaning, I will review some testing documents I used and how they affected the results. I used 5 documents: one 1-page letter, one 1-page conversational policy, one 2-page conversational policy, one 1-page executive summary, and one 6-page retention of records policy. The retention of records policy completely overwhelmed all the other documents. This was due to its length but also due to its repetition of key words such as “year”.
Recommended improvements
The word count process works well. Improvements can be made in cleaning the resulting word counts and making them more useful. This could be done by a few processes:
- Additional filters applied to the word count list
- Aggregate plural and singular forms
- Removal of MS Word formatting clues
- User added white list words – must include in count
- User added black list words – must exclude from count
- Statistical monitoring
- Ensure each document has a reasonable word distribution
- Statistically weight word count by document length
Some improvements to the cloud display could also be made. These include:
- Show list of word contexts when hovering over word
- Allow white list or key word list highlighting in cloud – different color perhaps
- Highlight words that are found in a comparison document. This could be a job description or call for papers.






