Reading in data: Youth Tobacco Survey

Peter Ralph

2026-01-05

Reading in data

Peter Ralph

https://uodsci.github.io/ds435

Set-up:

import pandas as pd

The National Youth Tobacco Survey

The NYTS is a CDC-funded national survey running since 1999 to assess youth rates of tobacco usage.

Datasets, by year, are currently available on this page.

We’re going to look at the 1999 version. Go read the documentation.

Download

We’ve got the option to download an MS Access or SAS file. We’ll use the SAS version. Get the files:

mkdir -p data
wget -P data https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/zip_files/1999_Codebook_Dataset_SAS.zip
unzip -d data data/1999_Codebook_Dataset_SAS.zip

and then read the PDF. (5-minute reading interlude)

orig_yts = pd.read_sas("data/nyts1999public.sas7bdat").convert_dtypes()
orig_yts

	STUDNTID	QN2	QN3	QN4A	QN4B	QN4C	QN4D	QN4E	QN4F	QN5	...	QN70	QN71	QN72	QN1	QN16	WT	PSU2	STRATUM2	RACE	YEAR
0	b'0000001'	1	4	<NA>	<NA>	<NA>	<NA>	<NA>	1	6	...	2	1	1	8	1	1.18729	12085	110	1	1999
1	b'0000002'	2	5	1	<NA>	<NA>	<NA>	<NA>	1	6	...	2	6	6	8	8	1.143755	12085	110	1	1999
2	b'0000003'	2	5	<NA>	<NA>	<NA>	<NA>	<NA>	1	6	...	2	4	1	7	7	1.143755	12085	110	1	1999
3	b'0000004'	2	5	<NA>	<NA>	<NA>	<NA>	<NA>	1	6	...	2	5	4	7	2	1.143755	12085	110	1	1999
4	b'0000005'	1	5	<NA>	<NA>	<NA>	<NA>	<NA>	1	6	...	2	6	3	8	1	1.143755	12085	110	1	1999
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
15053	b'0015054'	2	7	<NA>	1	<NA>	<NA>	<NA>	<NA>	2	...	2	4	1	9	9	0.704504	910061	902	4	1999
15054	b'0015055'	2	7	<NA>	1	<NA>	<NA>	<NA>	<NA>	2	...	2	1	1	9	1	0.704504	910061	902	4	1999
15055	b'0015056'	1	7	<NA>	1	<NA>	<NA>	<NA>	<NA>	2	...	2	3	1	9	1	0.704504	910061	902	4	1999
15056	b'0015057'	1	7	<NA>	1	<NA>	<NA>	<NA>	<NA>	2	...	2	1	1	9	1	0.704504	910061	902	4	1999
15057	b'0015058'	2	7	<NA>	1	<NA>	<NA>	<NA>	<NA>	2	...	2	1	1	9	1	0.704504	910061	902	4	1999

15058 rows × 83 columns

Summaries

The data should all be from 1999? Oh, good:

orig_yts.YEAR.value_counts(dropna=True)

YEAR
1999    15058
Name: count, dtype: Int64

Are the STUDNTID numbers all unique? Oh, good.

orig_yts.STUDNTID.value_counts().max()

np.int64(1)

What is WT? It’s not weight-in-pounds:

orig_yts.WT.describe()

count     15058.0
mean          1.0
std      0.554389
min      0.037618
25%      0.647862
50%      0.889969
75%      1.253308
max       5.51623
Name: WT, dtype: Float64

Question 1

Information about question 1 from the brochure: “How old are you?” with values between 1=“9 years” to 13=“21 years”.

orig_yts.QN1.value_counts(dropna=False)

QN1
5       2762
4       2538
6       1915
3       1831
8       1779
7       1779
9       1750
10       450
2        102
<NA>      62
11        36
13        36
1         14
12         4
Name: count, dtype: Int64

Let’s recode this

We’ll store all the new variables in a dictionary and make that into a DataFrame at the end (in that order, because pandas).

yts = {}
yts['age'] = orig_yts.QN1 + 8
yts['age'].value_counts(dropna=False)

QN1
13      2762
12      2538
14      1915
11      1831
16      1779
15      1779
17      1750
18       450
10       102
<NA>      62
19        36
21        36
9         14
20         4
Name: count, dtype: Int64

Question 3

Information about question 3 from the brochure: “What grade are you in?” with values from 1=6th to 7=12th, and 8=“ungraded or other grade”.

yts['grade'] = orig_yts.QN3 + 5
yts['grade'].value_counts(dropna=False)

QN3
7       2982
8       2617
6       2400
9       1801
12      1780
11      1696
10      1639
<NA>     123
13        20
Name: count, dtype: Int64

Consistency checking

pd.crosstab(yts['grade'], yts['age'])

QN1	9	10	11	12	13	14	15	16	17	18	19	20	21
QN3
6	4	97	1719	506	56	5	1	0	0	1	0	0	8
7	2	2	94	1939	847	82	13	0	1	0	0	0	2
8	0	1	1	78	1810	646	69	6	0	0	1	0	3
9	1	0	2	0	36	1129	544	71	11	3	0	0	2
10	0	0	1	0	0	41	1079	458	53	6	0	0	0
11	0	0	0	0	0	1	59	1174	420	34	4	1	3
12	3	0	0	0	0	0	6	66	1256	405	31	3	7
13	3	1	0	0	0	0	1	2	3	0	0	0	10

Question 2

Information about question 2 from the brochure: “What is your sex?” with values 1=Female and 2=Male.

“Missing” can be a useful option for survey questions.

orig_yts.QN2.value_counts(dropna=False)

QN2
2       7490
1       7471
<NA>      97
Name: count, dtype: Int64

import collections
d = collections.defaultdict(lambda: None, 
    { 1 : "F", 2 : "M" }
)
yts['sex'] = pd.Series([d[k] for k in orig_yts.QN2], dtype='string')
yts['sex'].value_counts(dropna=False)

M       7490
F       7471
<NA>      97
Name: count, dtype: Int64

Question 6

Information about question 6 from the brochure: “Have you ever tried cigarette smoking, even one or two puffs?”, with answers 1=yes and 2=no.

d = collections.defaultdict(lambda: None, 
    { 1 : True, 2 : False }
)
yts['ever_smoked'] = pd.Series([d[k] for k in orig_yts.QN6], dtype='boolean')
yts['ever_smoked'].value_counts(dropna=False)

False    7993
True     6943
<NA>      122
Name: count, dtype: Int64

Put it together

yts = pd.DataFrame(yts)
yts

	age	grade	sex	ever_smoked
0	16	9	F	False
1	16	10	M	True
2	15	10	M	True
3	15	10	M	True
4	16	10	F	True
...	...	...	...	...
15053	17	12	M	True
15054	17	12	M	False
15055	17	12	F	False
15056	17	12	F	False
15057	17	12	M	True

15058 rows × 4 columns

Summarize

We want to split-apply-combine. See this list of built-in aggregation methods.

yts.groupby(["grade","sex"]).aggregate(
    prop_smoked = ("ever_smoked", "mean"),
    n = ("ever_smoked", "size")
)

		prop_smoked	n
grade	sex
6	F	0.13108	1226
6	M	0.186632	1164
7	F	0.311315	1536
7	M	0.337772	1434
8	F	0.436179	1280
8	M	0.42445	1325
9	F	0.570439	869
9	M	0.551799	926
10	F	0.607843	816
10	M	0.632029	821
11	F	0.656827	813
11	M	0.704598	871
12	F	0.715438	870
12	M	0.716186	903
13	F	0.923077	13
13	M	0.666667	7

Missing Values

What about the missing values?

Question: what does pandas say the “mean” of [True, False, NA] is? (What should it say?)

pd.Series([True, False, pd.NA], dtype="boolean").mean()

np.float64(0.5)