Thursday, September 23, 2021

Census Household Pulse Survey - Tips for Analyzing Sexual Orientation and Gender Identity

 Hooray! The US Census has finally provided estimates of the sexual and gender minority populations in the United States!

I am in the process of learning how these numbers work, and am eager to pass along what I've learned to other researchers.

As part of the "Household Pulse Survey", a weekly survey of the entire US population designed to gather vital information on the COVID-19 pandemic and related topics, the Census included items on sexual orientation and gender identity starting on week 34. As of this writing, there are 3 weeks of data to work with - a bit over 200,000 respondents, already approaching the sample size of a full year of BRFSS data! (and even more important, not preselected by which state you live in, or whether you are answering an out-of-state cell phone (see my recent article in AJPM for more detail on that)).


Comparability to BRFSS

Many of the questions are identical to those fielded in BRFSS, or can be easily transformed to a comparable format. The sexual orientation item is nearly identical, simply requiring a recode of 5="I don't know" to 7, and -99 to 9.

The sex at birth and gender identity questions are not exactly comparable, however. There are some complications that require a bit of finesse before using the gender identity variables.

The raw data files can be downloaded from: www.census.gov/programs-surveys/household-pulse-survey/datasets.html#phase3.2 .


Tip #1: Restrict to AGENID_BIRTH=2.

For both sexual orientation and gender identity, any analysis should be restricted to cases where AGENID_BIRTH=2. AGENID_BIRTH is a variable indicating whether sex at birth was imputed (1) or not (2). Census used a "hot deck" imputation technique to impute missing values for several key variables, including sex at birth (EGNEID_BIRTH) and current gender identity (GENID_DESCRIBE). When sex at birth or current gender identity are imputed, Census replaces these missing values with values from other respondents, in a (not quite) random fashion. As a result, about half of the respondents randomly assigned male at birth are assigned a current gender identity of female (and vice versa), which would indicate that they are transgender. Because sex at birth is imputed for about 3% of the total population, about 1.5% of people are unintentionally imputed to be transgender when they are in fact cisgender - a common enough occurrence that it overwhelms the population of people who are actually transgender.

The great majority of researchers who don't want to go to all the trouble of performing a full multiple imputation where these variables strongly inform one another (as opposed to being treated as nearly independent as this particular hot deck imputation technique appears to assume), should just take the simple route of restricting the analysis to AGENID_BIRTH=2.

By implication, anyone looking at sexual orientation should probably also make this restriction, especially when also looking at sex (which one should always do when looking at sexual orientation), otherwise you'll get gay men in your lesbian group, and so on. Not as large an error as for gender identity, but why use analytic groups you know are premixed in such a way as to minimize distinctions between the groups?


Tip #2: Use an expansive definition of transgender.

Don't be fooled by the simplicity of the "current gender identity" variable (GENID_DESCRIBE), which looks like it differentiates between people who are transgender and cisgender male or female (and another group "none of these" - I'm holding this group out separately because I haven't yet examined this group in detail).

But GENID_DESCRIBE is about respondent's current gender identity, and many transgender people prefer to identify as "male" or "female" rather than as "transgender". Therefore, to identify transgender people in the Household Pulse Survey, one should also look for people whose sex at birth was male and whose current gender identity is female (and vice versa).

Here is SAS code to accomplish that recode. It puts the results into a format closer to BRFSS (but where there is no "gender non-conforming" option (BRFSS=3), and "none of these" is held out as a separate category (HPS=4, recoded to 5 for convenience).

if AGENID_BIRTH=2 then do;
* Male to Female Transgender
if EGENID_BIRTH=1 and GENID_DESCRIBE in(2,3) then TRNSGNDR=1; 
* Female to Male Transgender ;
else if EGENID_BIRTH=2 and GENID_DESCRIBE in(1,3) then TRNSGNDR=2; 
* Cisgender ;
else if (EGENID_BIRTH=1 and GENID_DESCRIBE in(1,-99))
or (EGENID_BIRTH=2 and GENID_DESCRIBE in(2,-99)) then TRNSGNDR=4;
* None of these
else if GENID_DESCRIBE=4 then TRNSGNDR=5;
end;

In many surveys, this sort of recoding is not recommended, because any slip-up in coding sex at birth or current gender identity is much too likely to result in falsely identifying cisgender respondents as transgender. However, in the Household Pulse Survey (and some other surveys), there is a follow-up question to confirm when people identify as one sex at birth and a different current gender, so this data-cleaning as it happens is probably sufficient protection against miscoding.


Tip #3: Combine waves, but adjust the weights

While each weekly wave of the Household Pulse Survey is a large survey, breaking the numbers down into subpopulations (e.g. by age, state, health status, etc.) can result in some pretty unstable estimates. Combining multiple waves is a great way to combat this instability - but be warned, the weights should to be adjusted to account for the fact that each week's weights are intended to represent the whole US population. The quick and dirty way to do this is simply to divide the weights by the number of waves you are combining. For instance, I started with 3 waves, so my adjusted weights are simply generated as PWEIGHT/3. Eventually, I'll probably do something a bit more sophisticated with adjusting the weights when combining waves, particularly if the sample size starts changing dramatically from one wave to another, or the balance between state-level sampling fractions is fiddled with. I may also want to multiply some sort of "recency bias" into the weights if the outcome is one where up-to-the-minute estimation is more conceptually important (i.e. making more recent observations weigh more than distantly past ones). But all that is in the future. For now, a simple division by the number of waves concatenated is sufficient.

I have also included the "wave" identifier as a stratum in proc surveyfreq. No strong theoretical basis for doing so, but it seemed like a good idea. Very much open to suggestions from others about how to best utilize the stratum and psu specifications.

more to come...