Python Package : `survey_tools`

Convenient tools for survey research data analysis

Python

Software

Author

Alex Bass

Published

August 15, 2024

Introduction

Having worked with survey data for over 6 years, I wanted to put a few simple tools into a python package others could use. Some of these tools were inspired by functions in R, some were inspired by software in previous places of employment, and some of these tools were to smooth over my personal pain points in survey data analysis.

On 15 August 2024, I published the first official release for survey_tools on github [here] and PyPI [here]. Stars, feedback, and contributions welcome on github!

Quick Vignette

Installation

First, let’s install the package! In your terminal, you can easily install survey_tools through pip.

python -m pip install survey_tools

Import Packages + Load In Data

For our purposes here, I’ll load in a survey data set that I have worked on in the past called the American Family Survey which is publically available.

import pandas as pd
import numpy as np
import survey_tools as st

link = 'https://csed.byu.edu/00000183-a4c5-d2da-abe3-feed7be30001/2021data'
data = pd.read_stata(link)
print(data.shape)
data.head()

(3000, 413)

	caseid	weight	PAR006_treat	FAMTAX007_treat	s21_MSC001	s21_MSC003	s21_MSC003_b_1	s21_MSC003_b_2	s21_MSC003_b_3	s21_MSC003_c	...	votereg	ideo5	newsint	religpew	pew_churatd	pew_bornagain	pew_religimp	pew_prayer	starttime	endtime
0	1492039695	0.698217	Show rows: The coronavirus pandemic and Racial...	Treatment 1 ("pull parents away")	Not currently in a committed relationship	NaN	not selected	not selected	selected	NaN	...	Yes	Very liberal	Most of the time	Protestant	Once a week	No	Very important	Once a day	1.940257e+12	1.940258e+12
1	1492042119	1.195809	Show rows: The coronavirus pandemic and Racial...	Treatment 2 ("encourage poverty")	Not currently in a committed relationship	NaN	selected	not selected	not selected	2005	...	Yes	Conservative	Most of the time	Roman Catholic	Never	No	Somewhat important	Seldom	1.940257e+12	1.940258e+12
2	1492870805	1.155043	Show rows: The coronavirus pandemic and Racial...	Control	Married	7 years	not selected	not selected	selected	NaN	...	Yes	Moderate	Don't know	Nothing in particular	Never	No	Not at all important	Never	1.940258e+12	1.940258e+12
3	1492850287	0.771161	No extra rows on PAR006	Treatment 2 ("encourage poverty")	Not currently in a committed relationship	NaN	not selected	selected	not selected	NaN	...	Yes	Moderate	Most of the time	Roman Catholic	Seldom	No	Somewhat important	A few times a week	1.940257e+12	1.940258e+12
4	1492863669	0.810394	No extra rows on PAR006	Treatment 1 ("pull parents away")	Currently in a committed relationship, but not...	NaN	selected	not selected	not selected	2005	...	Don't know	Conservative	Some of the time	Nothing in particular	Seldom	No	Not at all important	A few times a week	1.940257e+12	1.940258e+12

5 rows × 413 columns

`get_names` Function

Let’s try to find the education demographic variable

st.get_names(data, r'[Ee][Dd]')

['s21_EMP005_fed',
 's21_ED002',
 's21_ED004',
 's21_ED004_a',
 's21_ED005_1',
 's21_ED005_2',
 's21_ED005_3',
 's21_ED005_4',
 's21_ED006_1',
 's21_ED006_2',
 's21_ED006_3',
 's21_ED006_4',
 's21_ED006_5',
 's21_ED006_6',
 's21_ED006_7',
 's21_ED006_8',
 's21_ED006_9',
 's21_ED006_10',
 's21_ED007_1',
 's21_ED007_2',
 's21_ED007_3',
 's21_ED007_4',
 's21_ED008',
 's21_ED009',
 's21_ED010_1',
 's21_ED010_2',
 's21_ED010_3',
 's21_ED010_4',
 's21_ED011_1',
 's21_ED011_2',
 's21_ED011_3',
 's21_ED011_4',
 'educ']

Looks like we have several variables with ED, but there is an educ variable which is likely what we’re after. This function is useful for selecting groups of variables based on regex for easier manipulation of data.

`tabs` Function for 1-way Summary

Let’s run a quick summary of this variable.

st.tabs(data, 'educ', dropna=False)

No HS                   137
High school graduate    924
Some college            619
2-year                  334
4-year                  615
Post-grad               371
NaN                       0
dtype: int64

This looks like what we are interested in. Note that NaN is included at the bottom of the table since we specified dropna=False showing up that there are no missing values for this variable.

Note how we can also specify weights with the wts argument like below.

st.tabs(data, 'educ', dropna=False, wts="weight")

No HS                   238.012949
High school graduate    929.096289
Some college            597.860835
2-year                  320.430679
4-year                  575.119248
Post-grad               339.480000
NaN                       0.000000
dtype: float64

We can now see these as weighted counts. If you want to see them as percentages just specify it! In this case, we want to normalize by column.

st.tabs(data, 'educ', dropna=False, wts="weight", display='column')

No HS                    7.9
High school graduate    31.0
Some college            19.9
2-year                  10.7
4-year                  19.2
Post-grad               11.3
NaN                      0.0
dtype: float64

`recode` Function

I want to collapse these groups into two categories: No Bachelor’s Degree vs. Bachelor’s Degree or Higher.

data['educ_rc'] = st.recode(
    data, 
    'educ',
    '"No HS"=0;'\
    '"High school graduate"=0;'\
    '"Some college"=0;'\
    '"2-year"=0;'\
    '"4-year"=1;'\
    '"Post-grad"=1;'\
)

st.tabs(data, 'educ_rc')

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/survey_tools.py:266: UserWarning:

Column dtype changed from CATEGORY to CATEGORY.

0    2014
1     986
dtype: int64

Typically, the survey data I work with is represented by numbers for answers instead of strings. The recode function is much more convenient in this case (see below).

data['educ_numbers'] = data.educ.cat.codes
data['educ_rc'] = st.recode(
    data, 
    'educ_numbers',
    '0:3="No B";4:5="B+"'
)

st.tabs(data, 'educ_rc')

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/survey_tools.py:266: UserWarning:

Column dtype changed from INT8 to OBJECT.

No B    2014
B+       986
dtype: int64

Tip

The recode function has a few special keywords. For example, lo finds the lowest number e.g. lo:10=1 and hi works the opposite way. Also, NaN is a key word selecting or setting missing values in your variables e.g NaN=3

I’ll quickly recode another variable, so we can compare.

data['newsint'] = data.newsint.cat.codes
data['newsint_rc'] = st.recode(data, 'newsint', "0='High';1:5='Low'")
st.tabs(data, 'newsint_rc')

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/survey_tools.py:266: UserWarning:

Column dtype changed from INT8 to OBJECT.

High    1510
Low     1490
dtype: int64

`tabs` Function for 2-way Tabulation

Let’s look at a crosstab break of news interest by education.

st.tabs(data, 'newsint_rc', 'educ_rc', wts = "weight", display='column')

	No B	B+
High	42.7	61.8
Low	57.3	38.2

From this, we can see those with a bachelor’s degree or higher are more interested in the news than those without.

Tip

Using the display argument, we could also summarize by row or cell

`rake_weight` Function

This survey data is already weighted in the weight column we’ve been using, but if your data isn’t already weighted use the rake_weight function.

I’ll show a simple example below.

data['age'] = 2021 - data.birthyr
data['age_rc'] = st.recode(data, 'age', '0:30=1;31:45=2;46:65=3;66:120=4')
st.tabs(data, 'age_rc', display='column')

3    35.1
2    26.8
4    19.6
1    18.6
dtype: float64

data['gender_rc'] = st.recode(data, 'gender', '"Male"=1;"Female"=2')
st.tabs(data, 'gender_rc', display='column')

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/survey_tools.py:266: UserWarning:

Column dtype changed from CATEGORY to CATEGORY.

1    46.8
2    53.2
dtype: float64

Above you can see the unweighted tabulations of age and gender. Let’s weight them now.

true_props = pd.DataFrame({
    'Names':['gender','gender','age_rc','age_rc','age_rc','age_rc',],
    'Levels':['Male', 'Female',1,2,3,4],
    'Proportions':[0.5,0.5,0.2,0.25,0.35,0.2],
})

data_w_new_wts = st.rake_weight(data, true_props, weight_nm='new_weight')

Variable:  gender
Male      50.0
Female    50.0
dtype: float64
Variable:  age_rc
3    35.0
2    25.0
4    20.0
1    20.0
dtype: float64

            Iterations: 1
            Max Weight: 1.1487352180792596
            Min Weight: 0.876682464644851

The rake_weight function outputs a few statistics like max weight, min weight, and iterations. You can also see the weighted tabs to see how well your weights match up. You can also disable the summary setting qa=False

As mentioned above, Stars, feedback, and contributions welcome for the survey_tools package!

Introduction

Quick Vignette

Installation

Import Packages + Load In Data

get_names Function

tabs Function for 1-way Summary

recode Function

tabs Function for 2-way Tabulation

rake_weight Function

`get_names` Function

`tabs` Function for 1-way Summary

`recode` Function

`tabs` Function for 2-way Tabulation

`rake_weight` Function