library(devtools)
install_github('mattroumaya/doubleheadr')
The past few months of work-life have been constantly busy with survey design and analysis. COVID has really changed the landscape for just about everything, and leaders of projects and departments want and need to implement change quickly, and survey research is helping to inform speedy decision making.
I’m here to demo the first R package that I’ve ever written that has been helping me quickly clean and tidy data from SurveyMonkey to prepare for analysis and reporting. This is also a continuation of my previous post about tidying the SurveyMonkey Double Header.
Background Info
doubleheadr
will be useful if you:
* Download or inherit data from SurveyMonkey as .xlsx
or .csv
* Commonly find yourself struggling to quickly tidy the header and sub-header, and intuitively rename columns
* Are fed up with your current process for the two steps above and are looking for a new approach
Getting survey data ready for analysis shouldn’t take more than a few lines of code (if any!), and that has been my personal objective ever since I started using R for survey analysis, reporting, and dashboard creation.
As a quick side note, if you have a SurveyMonkey account, you should absolutely check out Sam Firke’s surveymonkey
package, which makes it super easy to pull data from SurveyMonkey’s API. One limitation is that the API is limited to 500 calls per day, (meaning you will max out with any survey having over 50k responses). It’s also common for analysts and data wranglers to inherit response files from other departments or clients, and that’s where doubleheadr
can help out.
doubleheadr
in Action
You’ll need devtools
(to run install_github()
), tidyverse
, and janitor
(also by Sam Firke) to run doubleheadr
.
Step 1: install packages
I’ll assume you are familiar with installing from GitHub, and I’m working on being more concise.
Step 2: check out the demo file
library(tidyverse)
library(doubleheadr)
<- doubleheadr::demo demo
doubleheadr
comes with a built-in demo data set, which helps illustrate the two functions available: clean_headr
and trim_headr
. The data set demo
mimics a .xlsx
export from SurveyMonkey and looks like this:
Respondent ID | Please provide your contact information: | ...3 | ...4 | ...5 | ...6 | ...7 | ...8 | ...9 | ...10 | ...11 | I wish it would have snowed more this winter. |
---|---|---|---|---|---|---|---|---|---|---|---|
NA | Name | Company | Address | Address 2 | City/Town | State/Province | ZIP/Postal Code | Country | Email Address | Phone Number | Response |
11385284375 | Benjamin Franklin | Poor Richard's | NA | NA | Philadelphia | PA | 19104 | NA | benjamins@gmail.com | 215-555-4444 | Strongly disagree |
11385273621 | Mae Jemison | NASA | NA | NA | Decatur | Alabama | 20104 | NA | mjemison@nasa.gov | 221-134-4646 | Strongly agree |
11385258069 | Carl Sagan | Smithsonian | NA | NA | Washington | D.C. | 33321 | NA | stargazer@gmail.com | 999-999-4422 | Neither agree nor disagree |
As analyzers of this data, we will want to paste the column names and values in the first row together. It’s a bit more complicated than that, which is detailed in my previous post, but we’re being concise here.
Step 3: clean_headr
We can call clean_headr
on our data.frame
or tibble
object, which takes three arguments:
* dat
: a data.frame
object (in this case, inherited from LHS)
* rep_val
: the repeated value as a character string. Our column names have a repeated value of ‘…’ (or ‘..’, or ‘.’)
* clean_names
: this is a janitor
function that will convert all column names to snake_case and will strip out any non-alphanumeric characters.
%>%
demo clean_headr(rep_val = '...', clean_names = TRUE) %>%
colnames()
[1] "respondent_id"
[2] "please_provide_your_contact_information_name"
[3] "please_provide_your_contact_information_company"
[4] "please_provide_your_contact_information_address"
[5] "please_provide_your_contact_information_address_2"
[6] "please_provide_your_contact_information_city_town"
[7] "please_provide_your_contact_information_state_province"
[8] "please_provide_your_contact_information_zip_postal_code"
[9] "please_provide_your_contact_information_country"
[10] "please_provide_your_contact_information_email_address"
[11] "please_provide_your_contact_information_phone_number"
[12] "i_wish_it_would_have_snowed_more_this_winter_response"
We can compare this with clean_names = FALSE
, which creates column names that are very similar to the actual survey questions that we’ve asked.
%>%
demo clean_headr(rep_val = '...', clean_names = FALSE) %>%
colnames()
[1] "Respondent ID"
[2] "Please provide your contact information: Name"
[3] "Please provide your contact information: Company"
[4] "Please provide your contact information: Address"
[5] "Please provide your contact information: Address 2"
[6] "Please provide your contact information: City/Town"
[7] "Please provide your contact information: State/Province"
[8] "Please provide your contact information: ZIP/Postal Code"
[9] "Please provide your contact information: Country"
[10] "Please provide your contact information: Email Address"
[11] "Please provide your contact information: Phone Number"
[12] "I wish it would have snowed more this winter. Response"
Step 4: trim_headr
Whether we use clean_names
or not, we will most likely want to shorten some of the column names so that our code is more legible. The demo data here is pretty conservative - if you have a verbose survey question, you can imagine how long your column names could be!
trim_headr
makes it easy to shorten column names, and is really just composed of a concatenated gsub
call on all of the column names. Let’s say we do use clean_names
and we want to remove “please_provide_your_contact_” from our column names, so that we will still be left with shorter, descriptive names like information_name
, information_company
, etc. Let’s also shorten our last column name to snowed_more_this_winter
.
%>%
demo clean_headr(rep_val = '...') %>%
trim_headr(c('please_provide_your_contact_', 'i_wish_it_would_have_', '_response')) %>%
colnames()
[1] "respondent_id" "information_name"
[3] "information_company" "information_address"
[5] "information_address_2" "information_city_town"
[7] "information_state_province" "information_zip_postal_code"
[9] "information_country" "information_email_address"
[11] "information_phone_number" "snowed_more_this_winter"
Easy! Depending on the number of survey questions, getting to this point could take a really long time by first manually renaming columns and then deleting the first row. I’ve found this workflow to be more intuitive, and leaves me more time to focus on the data and creating effective dashboards and reports.
If you happen to try out doubleheadr
and find it helpful, or more likely, find some bugs or inefficiencies, I would love to hear about it!