Last year, we helped the City of Houston launch a cost-savings crowdsourcing program called Bright Ideas.
I was reading through some of the responses, and I got curious about terminologies, priorities, and the general language used by City of Houston employees.
So on April 9, 2016, I grabbed a database dump of the Bright Ideas website and followed a modified version of these excellent text mining instructions. Here’s the github repo.
Preparing the Data for Analysis
I used R Studio to prep and analyze the data. I started by removing punctuation throughout the document:
docs <- tm_map(docs, removePunctuation)
Then I converted the text to lowercase:
docs <- tm_map(docs, to lower)
Originally, I tried stemming the text, but it lost some important details along the way. So I eyeballed some of the problems and controlled for the obvious:
for (j in seq(docs)) { docs[[j]] <- gsub("\t", " ", docs[[j]]) docs[[j]] <- gsub("costs", "cost", docs[[j]]) docs[[j]] <- gsub("customer service", "customer_service", docs[[j]]) docs[[j]] <- gsub("super bowl", "superbowl", docs[[j]]) docs[[j]] <- gsub("city of houston", "COH", docs[[j]]) docs[[j]] <- gsub("city", "COH", docs[[j]]) docs[[j]] <- gsub("COHs", "COH", docs[[j]]) docs[[j]] <- gsub("liabilities", "liability", docs[[j]]) docs[[j]] <- gsub("employees", "employee", docs[[j]]) docs[[j]] <- gsub("departments", "department", docs[[j]]) docs[[j]] <- gsub("departmental", "department", docs[[j]]) docs[[j]] <- gsub("services", "service", docs[[j]]) }
Finally, I removed all the stop words and eliminated unnecessary white space from the data:
docs <- tm_map(docs, removeWords, stopwords("english")) docs <- tm_map(docs, stripWhitespace)
Here are the results:

Most popular words by word count.
After the top 15, individual word popularity levels off:
Here's a closer look at some of the subsegments:

Words occurring between 35-50 times.

Words occurring between 25-34 times.

Words occurring 17-24 times.
What it means
Here were my takeaways:
- 50+ mentions segment: this is a cost savings program, so it's pretty normal to see words like cost, department, service, review, and contract at the top of the list.
- 35-49 mentions segment: it's clear that operations are a priority. There were a lot of people- and performance-centric terms, like customer, training, program, staff, and data.
- 25-34 mentions segment: skews towards strategy and planning with terms like plan, process, information, model, performance, budget, and business. There was also a bit of IT jargon, such as Sharepoint and ILMS (integrated land management system).
- 17-24 mentions segment: the words that jumped out here were spending and vendor. I'm pretty surprised those terms weren't more popular.
When the program ends, I will run another analysis that compares these findings to the final results.