Exercise: Youth Tobacco Survey

Author

Peter Ralph

Published

January 9, 2026

Youth Tobacco Survey

The National Youth Tobacco Survey is a federally-funded survey to assess tobacco usage in youth. It has done this since 1999 using a survey given to middle and high school students across the country with a rather large number of questions. Some of the questions have changed over the years, but they are generally stable. It is a stratified survey - there’s a lot going into the survey design that we won’t get into here.

The raw data from past years until 2019 are available (for now) at this link. The data are not harmonized: to some extent, different formats are available for different years. Your job will be to extract some summary statistics from a few particular years. You should download the SAS version of the datasets.

The questions we developed in class are:

How does the amount of disposable income correlate with cigarette versus vape use? To answer this, we want to ask about cigarette versus vape use among people who smoke regularly, so:
- subset the data to “people who smoke regularly”, and
- of these people, find out whether (or perhaps how much) they smoke each.
- Then, report the percentages of (cigarette, vape, both, neither).
When did students start smoking, regularly?
- Find the mean reported age at which students report having ever smoked (vape or cigarette), and if possible,
- the mean reported age at which students report starting to smoke regularly.
Is there a lower response rate for later questions?
How have social pressures changed over time: what proportion of students would use X if offered, where X={cigarette,vape}?

Example code from class:

Here’s code for working with the 1999 dataset. However, you should not use the 1999 dataset, because there is not information about vaping until about 2014. Also, beware that the coding for sex changes at some point: check the PDF!

First, recode some questions:

import collections
import pandas as pd

sasfile = "data/nyts1999public.sas7bdat"
orig_yts = pd.read_sas(sasfile).convert_dtypes()

# recode 'age' and 'grade'
yts['age'] = orig_yts.QN1 + 8
yts['grade'] = orig_yts.QN3 + 5

# recode 'sex'
d = collections.defaultdict(lambda: None, 
    { 1 : "F", 2 : "M" }
)
yts['sex'] = pd.Series([d[k] for k in orig_yts.QN2], dtype='string')

# and Q6, "have you ever smoked"
d = collections.defaultdict(lambda: None, 
    { 1 : True, 2 : False }
)
yts['ever_smoked'] = pd.Series([d[k] for k in orig_yts.QN6], dtype='boolean')

Next, some summaries: for instance, we can look at the age distribution:

yts['age'].value_counts()

And, we can get percentages of “ever smoked” by grade and sex:

yts.groupby(["grade","sex"]).aggregate(
    prop_smoked = ("ever_smoked", "mean"),
    n = ("ever_smoked", "size")
)