### Data and variables

We used public-access data files from the 1999–2012 National Health and Nutrition Examination Survey (NHANES), conducted by the National Center for Health Statistics (NCHS). The first NHANES was administered in 1971. Since 1999 the survey has been a continuous program, examining a nationally representative sample of about 5,000 persons each year.

The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel.

Body weight and height were obtained by trained health technicians at medical examination sites. BMI was calculated as body weight in kilograms divided by the square of height in meters (kg/m^{2}). Following the National Institutes of Health in Clinical Guidelines [15], we classified subjects as underweight (BMI < 18.5), normal weight (18.5 ≤ BMI < 25), overweight (25 ≤ BMI < 30) or obese (BMI ≥ 30).

In NHANES questionnaires, each respondent was asked about smoking behavior. We defined subjects as never smokers (those who never smoked 100 cigarettes in their lifetime), current smokers (those who smoked at least 100 cigarettes in their lifetime and also smoked now), or former smokers (those who smoked at least 100 cigarettes in their lifetime but did not smoke now). Current smokers were asked to provide information about cigarette consumption, which we categorized according to cigarettes smoked per day (1–14, 15–24 and 25+). Former smokers were asked about the length of time since smoking cessation, which we categorized according to number of years since quitting (<1, 1–4, 5–9 and 10+).

We included a set of demographic variables associated with BMI: age, height, race and ethnicity (white, black, Hispanic, and other, which included American Indian, Alaska Native, Pacific Islander, and Asian), foreign or U.S. born, highest educational attainment (less than high school, high school, more than high school), marital status (married and unmarried) and number of children (women only). In addition, we included a quadratic of age (age squared) to capture a non-linear relationship between that variable and BMI. We also controlled for survey cycle (1999–2000, 2001–02, 2003–04, 2005–06, 2007–08, 2009–10, and 2011–12), and we constructed sample weights across survey cycles using a formula from NCHS guidelines [16].

Notably, participants who were in a single person household during part of the 1999–2000 cycle were not asked about their marital status [15], and it was subsequently imputed by NHANES for most of them. However, it remained undetermined for about 300 subjects, in which we created an indicator with the value of 1 for respondents with missing marital status and 0 otherwise. We did the same for women who had no information on their pregnancy history. In this manner subjects with missing information were not excluded from the analyses and were subject to full evaluation.

We restricted our sample to adults aged 25 to 64 years. We excluded respondents aged 18–24 years because of inconsistencies in the 100-cigarette lifetime question between surveys. We also excluded pregnant women and respondents aged 65+ years because both smoking and weight status may be influenced by chronic illnesses at older ages.

### Empirical models

In this study, we examined the associations between smoking status (i.e. never, current and former smokers) and both a continuous measure of BMI and conventional BMI categories. We employed quantile regression and an ordered probit model to study the association of smoking status and weight across the BMI distribution and across BMI categories.

First, we examined the association between smoking status and a continuous measure of BMI using the Ordinary-Least Squares (OLS) method using the following model:

$$ BMI={\beta}_0+{\beta}_1S+{\beta}_2X+{\beta}_3Z+\varepsilon $$

(1)

where S represents smoking status categories (never, current and former smokers); X is a vector of individual characteristics; Z is a vector of survey cycle; and ε is an error term.

The main coefficient of interest is *B*
_{
1
}, which captures the relationship between smoking status on BMI. Positive or negative coefficients on current or former smokers indicate that, compared to never smokers, these groups have lower or higher BMI. Next, the vector X includes demographic variables such as age, race and ethnicity, US or foreign born, education, and marital status. A set of coefficients *β*
_{2} present the associations of these variables and BMI. Finally, the vector Z refers to survey cycles, which allows us to account for unobservable confounding variables that may vary across survey cycles.

While the OLS regression estimates the association of smoking with average BMI, this relationship may differ across the BMI distribution [

17]. For example, compared to never smokers, being a current or former smoker may reduce or increase BMI differently among those who are in the upper tail of the distribution compared with those in the lower tail. Therefore, we also employed the quantile regression (QR) analysis, which allows us to examine the entire conditional distribution of BMI and determine if an association between smoking and demographic variables differed across the BMI distribution. We estimated the following model:

$$ BMI={\beta}_0^p+{\beta}_1^pS+{\beta}_2^pX+{\beta}_3^pZ+{\varepsilon}^{\left(\mathrm{p}\right)} $$

(2)

where p refers to the proportion of the population having BMI below the quantile at *p*. In this study, *p* represents five quantiles: 10th, 25th, 50th, 75th and 90th. \( {\beta}_1^p \) represents an association between smoking and BMI for the *p*
^{
th
} conditional quantile.

The relationship between smoking and BMI may be non-linear, so we estimated an association between smoking status and BMI categories, CBMI (underweight = 1, normal weight = 2, overweight = 3, and obese = 4) by using the ordered probit (OP) model. In a nutshell, we estimated the following model:

$$ CBMI={\beta}_0+{\beta}_1S+{\beta}_2X+{\beta}_3Z+\varepsilon $$

(3)

However, the coefficients of OP estimates cannot be interpreted directly as the coefficients of linear models. We calculated the marginal effects, which measures the changes in probability of being underweight, normal, overweight or obese associated with a change from a never smoker to a current or former smoker with all other independent variables held at the values of their means.