python -m pip install survey_toolsPython Package : survey_tools
Convenient tools for survey research data analysis

Introduction
Having worked with survey data for over 6 years, I wanted to put a few simple tools into a python package others could use. Some of these tools were inspired by functions in R, some were inspired by software in previous places of employment, and some of these tools were to smooth over my personal pain points in survey data analysis.
On 15 August 2024, I published the first official release for survey_tools on github [here] and PyPI [here]. Stars, feedback, and contributions welcome on github!
Quick Vignette
Installation
First, let’s install the package! In your terminal, you can easily install survey_tools through pip.
Import Packages + Load In Data
For our purposes here, I’ll load in a survey data set that I have worked on in the past called the American Family Survey which is publically available.
import pandas as pd
import numpy as np
import survey_tools as st
link = 'https://csed.byu.edu/00000183-a4c5-d2da-abe3-feed7be30001/2021data'
data = pd.read_stata(link)
print(data.shape)
data.head()(3000, 413)
| caseid | weight | PAR006_treat | FAMTAX007_treat | s21_MSC001 | s21_MSC003 | s21_MSC003_b_1 | s21_MSC003_b_2 | s21_MSC003_b_3 | s21_MSC003_c | ... | votereg | ideo5 | newsint | religpew | pew_churatd | pew_bornagain | pew_religimp | pew_prayer | starttime | endtime | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1492039695 | 0.698217 | Show rows: The coronavirus pandemic and Racial... | Treatment 1 ("pull parents away") | Not currently in a committed relationship | NaN | not selected | not selected | selected | NaN | ... | Yes | Very liberal | Most of the time | Protestant | Once a week | No | Very important | Once a day | 1.940257e+12 | 1.940258e+12 |
| 1 | 1492042119 | 1.195809 | Show rows: The coronavirus pandemic and Racial... | Treatment 2 ("encourage poverty") | Not currently in a committed relationship | NaN | selected | not selected | not selected | 2005 | ... | Yes | Conservative | Most of the time | Roman Catholic | Never | No | Somewhat important | Seldom | 1.940257e+12 | 1.940258e+12 |
| 2 | 1492870805 | 1.155043 | Show rows: The coronavirus pandemic and Racial... | Control | Married | 7 years | not selected | not selected | selected | NaN | ... | Yes | Moderate | Don't know | Nothing in particular | Never | No | Not at all important | Never | 1.940258e+12 | 1.940258e+12 |
| 3 | 1492850287 | 0.771161 | No extra rows on PAR006 | Treatment 2 ("encourage poverty") | Not currently in a committed relationship | NaN | not selected | selected | not selected | NaN | ... | Yes | Moderate | Most of the time | Roman Catholic | Seldom | No | Somewhat important | A few times a week | 1.940257e+12 | 1.940258e+12 |
| 4 | 1492863669 | 0.810394 | No extra rows on PAR006 | Treatment 1 ("pull parents away") | Currently in a committed relationship, but not... | NaN | selected | not selected | not selected | 2005 | ... | Don't know | Conservative | Some of the time | Nothing in particular | Seldom | No | Not at all important | A few times a week | 1.940257e+12 | 1.940258e+12 |
5 rows × 413 columns
get_names Function
Let’s try to find the education demographic variable
st.get_names(data, r'[Ee][Dd]')['s21_EMP005_fed',
's21_ED002',
's21_ED004',
's21_ED004_a',
's21_ED005_1',
's21_ED005_2',
's21_ED005_3',
's21_ED005_4',
's21_ED006_1',
's21_ED006_2',
's21_ED006_3',
's21_ED006_4',
's21_ED006_5',
's21_ED006_6',
's21_ED006_7',
's21_ED006_8',
's21_ED006_9',
's21_ED006_10',
's21_ED007_1',
's21_ED007_2',
's21_ED007_3',
's21_ED007_4',
's21_ED008',
's21_ED009',
's21_ED010_1',
's21_ED010_2',
's21_ED010_3',
's21_ED010_4',
's21_ED011_1',
's21_ED011_2',
's21_ED011_3',
's21_ED011_4',
'educ']
Looks like we have several variables with ED, but there is an educ variable which is likely what we’re after. This function is useful for selecting groups of variables based on regex for easier manipulation of data.
tabs Function for 1-way Summary
Let’s run a quick summary of this variable.
st.tabs(data, 'educ', dropna=False)No HS 137
High school graduate 924
Some college 619
2-year 334
4-year 615
Post-grad 371
NaN 0
dtype: int64
This looks like what we are interested in. Note that NaN is included at the bottom of the table since we specified dropna=False showing up that there are no missing values for this variable.
Note how we can also specify weights with the wts argument like below.
st.tabs(data, 'educ', dropna=False, wts="weight")No HS 238.012949
High school graduate 929.096289
Some college 597.860835
2-year 320.430679
4-year 575.119248
Post-grad 339.480000
NaN 0.000000
dtype: float64
We can now see these as weighted counts. If you want to see them as percentages just specify it! In this case, we want to normalize by column.
st.tabs(data, 'educ', dropna=False, wts="weight", display='column')No HS 7.9
High school graduate 31.0
Some college 19.9
2-year 10.7
4-year 19.2
Post-grad 11.3
NaN 0.0
dtype: float64
recode Function
I want to collapse these groups into two categories: No Bachelor’s Degree vs. Bachelor’s Degree or Higher.
data['educ_rc'] = st.recode(
data,
'educ',
'"No HS"=0;'\
'"High school graduate"=0;'\
'"Some college"=0;'\
'"2-year"=0;'\
'"4-year"=1;'\
'"Post-grad"=1;'\
)
st.tabs(data, 'educ_rc')/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/survey_tools.py:266: UserWarning:
Column dtype changed from CATEGORY to CATEGORY.
0 2014
1 986
dtype: int64
Typically, the survey data I work with is represented by numbers for answers instead of strings. The recode function is much more convenient in this case (see below).
data['educ_numbers'] = data.educ.cat.codes
data['educ_rc'] = st.recode(
data,
'educ_numbers',
'0:3="No B";4:5="B+"'
)
st.tabs(data, 'educ_rc')/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/survey_tools.py:266: UserWarning:
Column dtype changed from INT8 to OBJECT.
No B 2014
B+ 986
dtype: int64
The recode function has a few special keywords. For example, lo finds the lowest number e.g. lo:10=1 and hi works the opposite way. Also, NaN is a key word selecting or setting missing values in your variables e.g NaN=3
I’ll quickly recode another variable, so we can compare.
data['newsint'] = data.newsint.cat.codes
data['newsint_rc'] = st.recode(data, 'newsint', "0='High';1:5='Low'")
st.tabs(data, 'newsint_rc')/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/survey_tools.py:266: UserWarning:
Column dtype changed from INT8 to OBJECT.
High 1510
Low 1490
dtype: int64
tabs Function for 2-way Tabulation
Let’s look at a crosstab break of news interest by education.
st.tabs(data, 'newsint_rc', 'educ_rc', wts = "weight", display='column')| No B | B+ | |
|---|---|---|
| High | 42.7 | 61.8 |
| Low | 57.3 | 38.2 |
From this, we can see those with a bachelor’s degree or higher are more interested in the news than those without.
Using the display argument, we could also summarize by row or cell
rake_weight Function
This survey data is already weighted in the weight column we’ve been using, but if your data isn’t already weighted use the rake_weight function.
I’ll show a simple example below.
data['age'] = 2021 - data.birthyr
data['age_rc'] = st.recode(data, 'age', '0:30=1;31:45=2;46:65=3;66:120=4')
st.tabs(data, 'age_rc', display='column')3 35.1
2 26.8
4 19.6
1 18.6
dtype: float64
data['gender_rc'] = st.recode(data, 'gender', '"Male"=1;"Female"=2')
st.tabs(data, 'gender_rc', display='column')/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/survey_tools.py:266: UserWarning:
Column dtype changed from CATEGORY to CATEGORY.
1 46.8
2 53.2
dtype: float64
Above you can see the unweighted tabulations of age and gender. Let’s weight them now.
true_props = pd.DataFrame({
'Names':['gender','gender','age_rc','age_rc','age_rc','age_rc',],
'Levels':['Male', 'Female',1,2,3,4],
'Proportions':[0.5,0.5,0.2,0.25,0.35,0.2],
})
data_w_new_wts = st.rake_weight(data, true_props, weight_nm='new_weight')Variable: gender
Male 50.0
Female 50.0
dtype: float64
Variable: age_rc
3 35.0
2 25.0
4 20.0
1 20.0
dtype: float64
Iterations: 1
Max Weight: 1.1487352180792596
Min Weight: 0.876682464644851
The rake_weight function outputs a few statistics like max weight, min weight, and iterations. You can also see the weighted tabs to see how well your weights match up. You can also disable the summary setting qa=False
As mentioned above, Stars, feedback, and contributions welcome for the survey_tools package!