Text Mining the City of Houston Bright Ideas Program

Last year, we helped the City of Houston launch a cost-savings crowdsourcing program called Bright Ideas.

I was reading through some of the responses, and I got curious about terminologies, priorities, and the general language used by City of Houston employees.

So on April 9, 2016, I grabbed a database dump of the Bright Ideas website and followed a modified version of these excellent text mining instructions. Here’s the github repo.

Preparing the Data for Analysis

I used R Studio to prep and analyze the data. I started by removing punctuation throughout the document:

docs <- tm_map(docs, removePunctuation)

Then I converted the text to lowercase:

docs <- tm_map(docs, to lower)

Originally, I tried stemming the text, but it lost some important details along the way. So I eyeballed some of the problems and controlled for the obvious:

for (j in seq(docs))
docs[[j]] <- gsub("\t", " ", docs[[j]])
docs[[j]] <- gsub("costs", "cost", docs[[j]])
docs[[j]] <- gsub("customer service", "customer_service", docs[[j]])
docs[[j]] <- gsub("super bowl", "superbowl", docs[[j]])
docs[[j]] <- gsub("city of houston", "COH", docs[[j]])
docs[[j]] <- gsub("city", "COH", docs[[j]])
docs[[j]] <- gsub("COHs", "COH", docs[[j]])
docs[[j]] <- gsub("liabilities", "liability", docs[[j]])
docs[[j]] <- gsub("employees", "employee", docs[[j]])
docs[[j]] <- gsub("departments", "department", docs[[j]])
docs[[j]] <- gsub("departmental", "department", docs[[j]])
docs[[j]] <- gsub("services", "service", docs[[j]])

Finally, I removed all the stop words and eliminated unnecessary white space from the data:

docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace)

Here are the results:

most popular words by wordcount

Most popular words by word count.

After the top 15, individual word popularity levels off:

Top 100 Words from Bright Ideas

Here's a closer look at some of the subsegments:

words occurring between 35-50 times in the text

Words occurring between 35-50 times.

Words mentioned between 25-35 times

Words occurring between 25-34 times.

Words occurring 17-25 times

Words occurring 17-24 times.

What it means

Here were my takeaways:

  • 50+ mentions segment: this is a cost savings program, so it's pretty normal to see words like cost, department, service, review, and contract at the top of the list.
  • 35-49 mentions segment: it's clear that operations are a priority. There were a lot of people- and performance-centric terms, like customer, training, program, staff, and data.
  • 25-34 mentions segment: skews towards strategy and planning with terms like plan, process, information, model, performance, budget, and business. There was also a bit of IT jargon, such as Sharepoint and ILMS (integrated land management system).
  • 17-24 mentions segment: the words that jumped out here were spending and vendor. I'm pretty surprised those terms weren't more popular.

When the program ends, I will run another analysis that compares these findings to the final results.

Jeff Reichman

Jeff is passionate about data. He founded January Advisors, and serves on the board of two Houston nonprofits. Read his full bio on LinkedIn.