The past few months of work-life have been constantly busy with survey design and analysis. COVID has really changed the landscape for just about everything, and leaders of projects and departments want and need to implement change quickly, and survey research is helping to inform speedy decision making.

I’m here to demo the first R package that I’ve ever written that has been helping me quickly clean and tidy data from SurveyMonkey to prepare for analysis and reporting. This is also a continuation of my previous post about tidying the SurveyMonkey Double Header.

Background Info

doubleheadr will be useful if you:
* Download or inherit data from SurveyMonkey as .xlsx or .csv
* Commonly find yourself struggling to quickly tidy the header and sub-header, and intuitively rename columns
* Are fed up with your current process for the two steps above and are looking for a new approach

Getting survey data ready for analysis shouldn’t take more than a few lines of code (if any!), and that has been my personal objective ever since I started using R for survey analysis, reporting, and dashboard creation.

As a quick side note, if you have a SurveyMonkey account, you should absolutely check out Sam Firke’s surveymonkey package, which makes it super easy to pull data from SurveyMonkey’s API. One limitation is that the API is limited to 500 calls per day, (meaning you will max out with any survey having over 50k responses). It’s also common for analysts and data wranglers to inherit response files from other departments or clients, and that’s where doubleheadr can help out.

`doubleheadr` in Action

You’ll need devtools (to run install_github()), tidyverse, and janitor (also by Sam Firke) to run doubleheadr.

Step 1: install packages

I’ll assume you are familiar with installing from GitHub, and I’m working on being more concise.

library(devtools)
install_github('mattroumaya/doubleheadr')

Step 2: check out the demo file

library(tidyverse)
library(doubleheadr)

demo <- doubleheadr::demo

doubleheadr comes with a built-in demo data set, which helps illustrate the two functions available: clean_headr and trim_headr. The data set demo mimics a .xlsx export from SurveyMonkey and looks like this:

Respondent ID	Please provide your contact information:	...3	...4	...5	...6	...7	...8	...9	...10	...11	I wish it would have snowed more this winter.
NA	Name	Company	Address	Address 2	City/Town	State/Province	ZIP/Postal Code	Country	Email Address	Phone Number	Response
11385284375	Benjamin Franklin	Poor Richard's	NA	NA	Philadelphia	PA	19104	NA	benjamins@gmail.com	215-555-4444	Strongly disagree
11385273621	Mae Jemison	NASA	NA	NA	Decatur	Alabama	20104	NA	mjemison@nasa.gov	221-134-4646	Strongly agree
11385258069	Carl Sagan	Smithsonian	NA	NA	Washington	D.C.	33321	NA	stargazer@gmail.com	999-999-4422	Neither agree nor disagree

As analyzers of this data, we will want to paste the column names and values in the first row together. It’s a bit more complicated than that, which is detailed in my previous post, but we’re being concise here.

Step 3: `clean_headr`

We can call clean_headr on our data.frame or tibble object, which takes three arguments:
* dat: a data.frame object (in this case, inherited from LHS)
* rep_val: the repeated value as a character string. Our column names have a repeated value of ‘…’ (or ‘..’, or ‘.’)
* clean_names: this is a janitor function that will convert all column names to snake_case and will strip out any non-alphanumeric characters.

demo %>% 
  clean_headr(rep_val = '...', clean_names = TRUE) %>% 
  colnames()

 [1] "respondent_id"                                          
 [2] "please_provide_your_contact_information_name"           
 [3] "please_provide_your_contact_information_company"        
 [4] "please_provide_your_contact_information_address"        
 [5] "please_provide_your_contact_information_address_2"      
 [6] "please_provide_your_contact_information_city_town"      
 [7] "please_provide_your_contact_information_state_province" 
 [8] "please_provide_your_contact_information_zip_postal_code"
 [9] "please_provide_your_contact_information_country"        
[10] "please_provide_your_contact_information_email_address"  
[11] "please_provide_your_contact_information_phone_number"   
[12] "i_wish_it_would_have_snowed_more_this_winter_response"

We can compare this with clean_names = FALSE, which creates column names that are very similar to the actual survey questions that we’ve asked.

demo %>% 
  clean_headr(rep_val = '...', clean_names = FALSE) %>% 
  colnames()

 [1] "Respondent ID"                                           
 [2] "Please provide your contact information: Name"           
 [3] "Please provide your contact information: Company"        
 [4] "Please provide your contact information: Address"        
 [5] "Please provide your contact information: Address 2"      
 [6] "Please provide your contact information: City/Town"      
 [7] "Please provide your contact information: State/Province" 
 [8] "Please provide your contact information: ZIP/Postal Code"
 [9] "Please provide your contact information: Country"        
[10] "Please provide your contact information: Email Address"  
[11] "Please provide your contact information: Phone Number"   
[12] "I wish it would have snowed more this winter. Response"

Step 4: `trim_headr`

Whether we use clean_names or not, we will most likely want to shorten some of the column names so that our code is more legible. The demo data here is pretty conservative - if you have a verbose survey question, you can imagine how long your column names could be!

trim_headr makes it easy to shorten column names, and is really just composed of a concatenated gsub call on all of the column names. Let’s say we do use clean_names and we want to remove “please_provide_your_contact_” from our column names, so that we will still be left with shorter, descriptive names like information_name, information_company, etc. Let’s also shorten our last column name to snowed_more_this_winter.

demo %>% 
  clean_headr(rep_val = '...') %>% 
  trim_headr(c('please_provide_your_contact_', 'i_wish_it_would_have_', '_response')) %>% 
  colnames()

 [1] "respondent_id"               "information_name"           
 [3] "information_company"         "information_address"        
 [5] "information_address_2"       "information_city_town"      
 [7] "information_state_province"  "information_zip_postal_code"
 [9] "information_country"         "information_email_address"  
[11] "information_phone_number"    "snowed_more_this_winter"

Easy! Depending on the number of survey questions, getting to this point could take a really long time by first manually renaming columns and then deleting the first row. I’ve found this workflow to be more intuitive, and leaves me more time to focus on the data and creating effective dashboards and reports.

If you happen to try out doubleheadr and find it helpful, or more likely, find some bugs or inefficiencies, I would love to hear about it!

Background Info

doubleheadr in Action

Step 1: install packages

Step 2: check out the demo file

Step 3: clean_headr

Step 4: trim_headr

`doubleheadr` in Action

Step 3: `clean_headr`

Step 4: `trim_headr`