Sunday, April 28, 2013

Research Directions

    Hey there blogfriends, I'm super excited because I'm going to have a first-author paper coming out in a few days - about the racial distribution of trees and pavement across the US - and exploring a few reasons that may explain it, like segregation (yes) and poverty (no). It looks like there's going to be some press on it, so keep an eye out.
    And my next first-author paper is getting really close to submission - so it's probably six months to a year from publication. That one's about the influence of living in more segregated cities on the probability of experiencing racial discrimination. That one's pretty interesting - lots of studies within one particular city or another have found that experiences of racial discrimination tend to be less common among Blacks who live in predominantly Black neighborhoods, and more common among Blacks who live in predominantly White neighborhoods. As far as I can tell, ours is the first to look at the degree to which the overall segregated character of the city (and her suburbs) affects reporting of racial discrimination experiences. We're seeing pretty dramatic results in that more segregation results in more experiences of racial discrimination, for Blacks, Hispanics, Whites and Asians.

    But what I'm stymied with at the moment is where to go after my most recent first-author paper - showing that gay men are more likely to be in excellent health than straight men... I'd love to get another paper on TBLG health out there, relatively soon, but it's challenging, because I have to do the work on my own dime and my own time. So here's some ideas, and I'd love to hear your thoughts on what would be most helpful (helpful in any sense - informing policy, improving science, satisfying curiosity - whatever greases your gears).

ONE: Improving Identification of Same-Sex Couples in Large Probability Datasets
    I know. Boring title. But here's why this has been floating my boat lately. When I was working on gay men in excellent health, I looked at the biggest dataset I could lay my hands on, the BRFSS. There were a fair number of same-sex married couples, even before same-sex marriage was legal anywhere in the US, which struck me as odd. Another thing that was odd is that their demographics (how old they were, how many kids they have, whether they served in the military, etc.) were a lot like heterosexually married people. I figured that what was most likely happening was that a small number of heterosexually-married people were accidentally mis-coded - and ended up being counted as same-sex couples. So, I threw them out of the analysis.
    BRFSS is especially vulnerable to this kind of error, but the problem is ubiquitous in any of the large probability samples that get used for research on same-sex couples - and rarely acknowledged.
    So what this project would be about is systematically going through the major datasets and trying to estimate how many of the same-sex couples identified are really same-sex couples, and how many are mis-coded heterosexually-coupled people.
    The main reason that it's important to do this project is that there are a lot of publications out there claiming that same-sex married couples are "just like" heterosexually-married couples. That may be a comforting message, and there's probably something to it, but a likely explanation that is almost never discussed is that a lot of those same-sex married couples are in fact heterosexuals. If we want an accurate picture, we need actual same-sex couples.

TWO: BLG health in relation to voting on marriage restrictions
    OK, so my thesis (never was able to get it published) was about the occurrence of suicide in relation to heteronormativity - the more heteronormative an area is, the higher the suicide rate there - especially for young men. I measured heteronormativity in three ways: the legal status of employment discrimination; how people voted on restricting marriage; how many same-sex couples the Census counted in an area.
     Given that nobody seems to care about employment discrimination any more these days, I figure that I should focus on the voting thing. The way I see it, how people in an area vote on restricting marriage to "one man and one woman" is a pretty good heteronormativity thermometer. There are some complications in that the wording is different from State to State, and the change in public attitudes is so rapid that a 60% endorsement rate today probably corresponds to an 80% endorsement rate in 2004. But assuming I can figure out a way to handle that, the other part is finding a dataset that has good BLG health measures in it.
    For my thesis, I used the overall suicide rate, and I didn't particularly care whether the people who died of self-inflicted injuries were "gay" or not. In fact, I suspect that the highest suicide risk associated with being gay or bisexual is before one declares openly to anyone else, and even before having sex, so it would be kind of silly to try to figure out who's who after they're dead. But I think that's one of the reasons I had trouble getting anyone interested in publishing it - it seems like people want to know how BLG people are affected by homophobia. Well, I'm interested in how heterosexuals are affected also. I very much doubt that it's a zero-sum game where heterosexuals gain some advantage while BLG people pay the price. I suspect it's much more likely that heterosexuals, too, are harmed by heteronormativity. And since there are a lot more of them, it should be even easier to pin that down. But I digress.
    So, I need a dataset that A) is a probability (random) sample of the US, B) has a large sample size (ideally in the 10's of millions, but I'll have to settle for less), C) identifies who is gay, lesbian, bisexual, and heterosexual, D) has a high degree of spatial resolution so I can figure out what the local homophobia "temperature" is, and E) has decent temporal resolution so I can figure out when people were sampled relative to important dates, and F) has decent measures of health in it.
    There are some datasets that come close to fitting the bill, but it's a challenge.

THREE: Transgender health from large population datasets
    There's only one publication out there about transgender health based on a probability sample - from the Massachusetts BRFSS. But there's the potential to do so much more. There are seven States that have asked about transgender identity on BRFSS. I'd love to collect the data from all seven, compare the basic demographics of transgender-identified people across the different question wordings & hypothesize about which questions work best. And then get into the health outcomes, much like the Massachusetts study did, but with much more data. I suspect that all of the question wordings are going to have a significant problem much like the same-sex married people identified in large population datasets - that is, even a very small number of errors in the coding of cisgender people is going to be a major headache. There's really only one way to handle that that I can think of - call them back to verify it - but I really can't see that happening anytime soon.

FOUR: The Real Blood Donors of Gaytown, USA
    There are just so many things wrong with banning gay blood donors. It made sense in 1985 (and frankly, it would have made even more sense earlier). But it doesn't make sense now, and everyone knows it. Including lots of gay men who donate blood anyway, and increasing numbers of young straight people who won't donate because they don't feel right about the discrimination. I'd love to be part of qualitative research on gay men who give blood. Why do they do it? How does it make them feel? What 'rules' about donating have they made for themselves to decide when they should and should not donate?
    There's a lot of interesting policy angles to wrangle through on this issue, but I think getting to know these guys would be really interesting - and informative in coming up with better deferral guidelines.

FIVE: Wage Gap and Death
    Strangely enough, there are only a handful of studies out there measuring how sexism affects health at a population level. Most of them use some sort of complicated mash of different ideas into an "index", and I hate indices - you never know what's really going on in there. So I took a simpler approach, just looking at the wage gap between men and women. It varies a lot - there are some parts of the country where women make almost as much as men, and some parts where men make about twice as much as women. What I expected to see was that women's mortality would be higher in areas where men make more. But I saw something completely different: where men make more relative to women, they live longer, but women's mortality is unrelated to the wage gap. I basically put this project on ice because I can't figure out a narrative that makes sense. But I could go back to it if y'all have fresh ideas.

So let me know, what do you think I should work on? And if you're feeling especially generous, for only $62,000, you get to decide.

Saturday, March 2, 2013

Origins of the Health Disparities Narrative

I recently did a guest lecture at Berkeley where the students asked me two questions that left me scratching my head...
1) When did the 'health disparities narrative' become dominant in public health, and 2) What dominant narratives about the health of socially-marginalized racial groups preceded it?
I don't know. But those are intriguing questions that deserve answers, so I'll ask for your indulgence as I flail around with some possible answers.

Defining the 'Health Disparities Narrative'
In public health, we tend to think about minority health in terms of 'health disparities'.
When we see that the health of a minority group is worse than that of socially-dominant groups, that is expected based on our narrative of how minority groups fit into social structures, and how these social structures influence health.
When we encounter exceptions to that general rule (cases that don't fit the narrative of health disparity) we tend to doubt the data and dismiss the findings. In those cases where the data shouts out over our attempts to silence it, we call it a 'paradox'.
So that's what I mean by the 'health disparities narrative' - an overarching narrative structure that strongly influences what we intuitively believe or doubt about the health of socially-marginalized groups. Which stories are 'easy' to tell, and which leave us tongue-tied and confused?

From the 'Sign of the Gene'...
I was first introduced to epidemiology in the mid-1980's. My recollection is that the go-to explanation for health differences between racial groups was that 'race' described biological distinction - that the environments of the various continents had 'bred' races of humans with differential susceptibility to disease. This go-to explanation was so ingrained that it was rarely stated explicitly. Implicitly, one message was that if racial difference reflects biologic difference, then an observed health disparity reflects something 'natural'. A racial disparity could be considered a 'risk factor', and be the basis for 'raising awareness', but would have little application in primary prevention (one would not 'prevent' someone from being one race or another).
The classic example of this was sickle cell anemia, usually quickly followed by cystic fibrosis, to demonstrate that every race had it's unfortunate susceptibilities.

In the early-mid 1990's, running up to the sequencing of 'the' human genome, news stories hit hot and heavy linking any and all manner of diseases and even personality traits to genes. Almost all of these reports were not confirmed in replication studies, but one thing became increasingly clear: the genes that were implicated in diseases were never the same genes that had different racial distributions. And in that handful of cases where there was some overlap, like in HLA markers, nothing panned out in further study in a way that explained racial disparities in health.

...and racism...
Despite the complete lack of evidence for the genetic basis of disparate health outcomes, genetic origins continued to be (and continues to be) the go-to explanation for many people in medicine and public health.

Fortunately, I was taught epidemiology by Sally Zierler, who countered the 'biologic distinction' interpretation of observed racial disparities, and offered instead the interpretation that racial disparities in health could be attributed to the relative social standing of those groups. Implicit in that interpretation was that a racial disparity should not be seen as 'natural': it should shock the conscience. It also leads to very different prevention strategies. Racism itself should be the focus, and Sally got a lot of heat for promoting that viewpoint.
A terrific example of this way of thinking is the ground-breaking analysis in 1997 by James Collins and Richard David, who pit the assumption of genetic origins head-to-head against an alternative hypothesis: that something about living as Black in America, especially during childhood, was the cause of high rates of premature deliveries seen among African-American mothers.

...to 'health disparities'...
The early 2000's is when I'd say, based only on my gut, that the way we think of 'health disparities' today really blossomed. I'm going to try to do a historical word count type of analysis to check that gut assumption, but in the meantime, I think it's safe to assert that the dominant interpretation of 'health disparities' as reflecting social structure is a recent phenomenon.
The prevention lessons we draw from the 'health disparity' narrative today are pretty varied - access to care, cultural competency, and also the 'fundamental cause' people like me - racism itself, the rest is downstream of that...

Epidemiologic Transition
The epidemiologic transition refers to the shift in patterns of causes of death, from chiefly infectious diseases striking all ages (and especially those under 5 years old) to chronic diseases that are more restricted to older populations. Epidemiology itself also had a few major transitions, but a little out of sync with the shift in mortality patterns. After 60 years or so of social epidemiology, infectious disease epidemiology rose in stature in the early 20th century. Infectious diseases dropped dramatically during both phases, but once the shift to chronic diseases as the major killers was largely complete in the 1950's and 60's, a new epidemiology arose, a chronic disease epidemiology which stressed multiple risk factors rather than single bugs. So, I suspect that the transition to chronic disease epidemiology was not linked to the development of the health disparities narrative, but it was a pre-condition.

Civil Rights Movement
I think the civil rights movement probably played a big role in the development of the health disparities narrative, but not as directly as one might at first think. Landmark legislation in the mid-1960's led to the involvement of the courts in race relations by the 1970's in a new way. Rather than being limited to assessing discrimination in individual cases, a new statistical reasoning made it's way into legal wranglings and regulatory frameworks: between affirmative action and desegregation orders, the quantification of inequity became a paramount consideration. My hunch is that these routinely quantified comparisons of racial groups played a big role in the development of the health disparities narrative. If one conceives of racial groups as being separate biological groups subject to evolutionary forces, then comparing racial groups to one another is like comparing apples to oranges - or really Granny Smiths to Cortlands. There are circumstances where comparisons make sense, but there is an assumption of difference built-in from the beginning. So I think the quantification of racial difference that the civil rights era ushered in certainly played a role, but other factors had to come into play as well.

Office of Management and Budget Standards for Data on Race and Ethnicity
As various Federal agencies struggled to enact regulations and enforce them, the fact that there is no agreed-upon definition of racial categories became clear. Rather than acknowledge that race is a complex characteristic, composed of many dimensions including self-identity and perception by others, the Federal Government tried to create a standardized set of categories that would be shared for all administrative purposes, with the Office of Management and Budget Directive #15 in 1977.
That may seem like an obscure bureaucratic detail, but the reason I connect it to the rise of the health disparities narrative is that by requiring governmental agencies to use the same five categories to describe race, data from multiple sources became comparable in a way that they had not been before. Death records could be matched to Census data (or at least appeared to be comparable), so race-specific rates were easier to calculate.
The use of these standardized categories was diffused throughout the government, and in particular required for research grants, including medical and public health research grants. As a result, not only was it possible to use comparable racial groups for comparisons, but the requirement that racial breakdowns be reported back to granting agencies implied an importance attached to race that encouraged researchers to analyze their results using racial categories as well. Steven Epstein has written a lot about that whole process.

Healthy People 2010
When I showed the charts below to Rachel, she made a great observation: that the rapid rise in the term 'health disparity' after 2000 is probably linked closely to the release of the Healthy People 2010 document, which had the secondary goal of 'eliminating health disparities'. Why did they use that phrase? When did they start using it? I've tried to find the exact date that this phrase entered the Healthy People documentation, but it'll take more research to nail that down.

In sum...
I suspect that the main shift in interpreting racial disparities in health has been from revealing inherent racial differences in biology to mirroring social structure. I'm not sure exactly when this happened, but my gut tells me that this shift happened in public health mostly in the late 1990's, early 2000's - certainly there were vanguards who foreshadowed this shift much earlier, and just as certainly there are laggards who have yet to embrace it. I'll be curious to see what text searching through publication databases reveals...

addendum: here is a quick & dirty analysis - the proportion of articles indexed in WebOfScience.com with 'racial difference', 'racial disparity', or 'health disparity' in the topics field. "Racial difference" (in purple) rises from 1991 2003, then plateaus or even drops in frequency. 'Racial disparity'(in green) was at low levels before 2001, then rises exponentially. 'Health disparity' (red) was virtually non-existent in articles published before 2000, then rises even more rapidly than 'racial disparity' as a topic term.


Addendum2: Another quick analysis of word counts in PubMed (which goes further back in time) shows Identical patterns (unfortunately, I swapped the green and red in these two charts). It is interesting to note that there was a jump in articles using the phrase 'racial difference' in the mid-1970's, and potentially a second jump in the mid-1980's








Thursday, February 21, 2013

Insight on Why Gay Marriage is Threatening to Some Christians

New to the blog? Skip to the Highlight Reel.

To me, one of the great mysteries is why so many people feel "threatened" by similar-gender marriage.
I think I get why some people are skeeved by (male) homosexuality - frankly for the same reason I was before I tried it - the "ick" factor of imagining sexual acts themselves in the abstract before you have any idea what they actually feel like.
So there's the "ick" factor, and its close kin, rank homophobia. And that's probably 80-90% of it right there.

I listen to certain religious right commentators every day - one might say religiously. In particular Bryan Fischer and Tony Perkins. In part because I want to know what they're talking about - they drive so much of the political and social opposition to me just having a normal day - so I want to know what's coming next from them. But I also listen to them because I'm curious, and I really do struggle to understand how they see the world.

I start from the supposition that people usually try to tell the truth (to the degree it is apparent to them), and that people at their base nature are good-hearted. I want to believe that these rabidly anti-gay commentators are honestly representing their perspective. Very often people like Tony and Bryan get written off as being cynical, dishonest, hypocritical. But I don't think that's the case. I actually think that they are giving a full-throated defense of their deeply-held beliefs. We've certainly seen plenty of cases (Larry Craig, Ted Haggart, Eddie Long, George Rekers...) of vehemently anti-gay men who turned out to be turning tricks.
But I think there are also plenty of people, like Tony and Bryan, who aren't hypocritical - they're just critical.

So, that's the crux (so to speak) of the mystery for me - how could a man who is heterosexual to the core, who does not hate homosexuals, yet feels so threatened by homosexuality? So threatened that they can't let a week go by without railing against it on a nationally syndicated radio program.

I finally had an insight about that. Unlike my assumption that people are basically good, and want to tell the truth, a key belief for many Christians is that we are born with a sinful nature, that without the restraints of morality, without the constraints of vows and pledges, we would naturally sin in any and potentially every way. In other words, without their faith and adherence to religious principles and practices, they would be unable to help themselves, and it would only be a matter of time before they finally got around to sinning in a homosexual fashion.

I know that sounds simple, and I can't believe it took me so long to figure it out. I have vague memories of a ninth grade teacher trying to explain "original sin" to me. It sounded like the weirdest work-around. In a lot of ways, listening to these guys is like being in a dream where you understand all the words someone is saying, but the meaning is absent. Except that I think I understand what they mean, but what they are really saying escapes me.

I'll keep listening - so you don't have to.

Friday, January 25, 2013

Are You Prepared for a Nuclear Detonation?

I'm not. And I doubt more than a handful of people in the US are. And I'm not sure whether that's a good thing or not. On the one hand, I don't want to be alarmist. On the other, If we as a nation value our nuclear weapons so highly that we won't get rid of them, can we reasonably expect that nobody will ever try to use them against us?

When the debate about the Seabrook nuclear power plant was happening, I was of an impressionable age. I lived in Amherst, New Hampshire, and Groton, Massachusetts, just across the border from one another. The gym at my school was labelled as a "fallout shelter" and looked like a bunker. I read a few books about the Manhattan Project and its terrible debut in Hiroshima. I was fascinated by the science and the scientists, and also by the majestic horror that they unleashed. I had dreams about running to that shelter and what might happen as we waited days and weeks for help to arrive.

Unlike people a few years older than me, I had never undergone a 'duck and cover' drill. By that point, they were mocked as silly - inducing unnecessary nightmares in our youth - not to mention useless in the face of a bomb capable of incinerating an entire city. I took some comfort in the fact that Fort Devens was in the next town over. In the event of an all-out exchange, I prayed we wouldn't have a chance of survival.

And yet, we as a nation still maintain an enormous arsenal of nuclear weapons, and devote considerable resources to them. Why? I'll leave that to others for the moment. A great discussion can be found on KQED's Forum here.

Although it has gone out of fashion to speculate about nuclear weapons use, I would argue that it is foolish to be as ill-prepared for it as we are. Unlike the nightmares of my youth, the prospect of an all-out exchange between the US and the Soviet Union has passed. The prospect of an all-out exchange between the US and any current nation is exceedingly remote.But with all the crazy people in this world, the more likely scenario is that someone, somewhere, will be able to put late-1930's technology together with evil intent to deliver a rudimentary nuclear weapon to our doorstep.

A Dirty Little Secret
Nuclear weapons are more survivable than we imagine. The 'duck and cover' drills of the 1950's and 1960's may have indeed been silly. But I suspect that the real reason they went out of fashion isn't because they were futile, or terrifying, but because continuing them made the idea of using nuclear weapons seem plausible. By preparing for nuclear war, we were countenancing the possibility that it might happen, and nobody wanted that. In that light, I want to be clear that claiming nuclear weapons are more survivable than we imagine should in no way make them easier to use.
In college, I read about an epidemiologic follow-up of survivors of the atomic bombings of Hiroshima and Nagasaki, to trace the longer term effects of nuclear weapons exposure (and from a more cynical perspective, to help set regulatory guidelines as to acceptable levels of x-ray exposures in the US). When I taught my epi classes at SFSU, I incorporated this article into teaching about cohort study design.
One thing that was shocking about this study was how many survivors there were. I don't in any way want to minimize the number of deaths, but it was shocking to me to learn that some people who were within 1 kilometer of the blast center survived, and relatively few people 10 kilometers away from the centers of the blasts were killed. Not only that, but the study treated these people beyond 10 kilometers as the 'unexposed cohort' - that the radiation dose one received from the initial blast at a distance of 10 kilometers was not much higher than background. Or in other words, if a Hiroshima-sized bomb went off over the Transamerica building in downtown San Francisco, we in our classroom at SFSU would be considered 'unexposed' in that study design.

Back during the debate about opening the Seabrook plant, there was a lot of scare-mongering using mushroom clouds to illustrate the risk of a nuclear meltdown. But a nuclear power plant can't explode, so I think that hyperbolic representation may have really undercut the cause. What nuclear weapons do share with nuclear power plants in terms of risk is fallout. That is, a nuclear power plant is not going to blow up and cause the explosive damage of a bomb, but both a bomb and a plant have the potential to release a lot of fine dust particles carrying radioactive elements over a relatively wide area. That's the scary part.

The hopeful part is that with a bit of preparation, you can offer yourself a great deal of protection from the fallout. First, close your windows and seal them up with plastic. I know, it sounds silly, but the biggest danger fallout presents is if you breathe it in or swallow it. Your skin is pretty good at dealing with two types of radiation (alpha and beta), but your lungs and stomach are very susceptible. That's because we are constantly bombarded with radiation from the sun, so we've evolved pretty good external defenses. The third type of radiation, gamma radiation, gets less harmful the farther you are away from it. As an analogy, if you hold a light bulb up to your face, it is blinding.  In the ceiling, it provides a nice glow, but the light from that bulb is practically useless if you are out in the driveway hunting for a dropped wallet.
So, making your home as air tight as possible makes it harder to breathe in or swallow fallout particles, and by keeping them outside the home, it keeps you farther from the gamma radiation.
Similarly, you want to stay in the middle of the house, away from ground level (where the fallout settles), but also not too near the roof (because it falls there too). And, if you can surround yourself with stuff, even better, because stuff absorbs radiation. Water is a great radiation-absorber, but books, even blankets, will help a little bit. You got a water bed? Awesome place to crash.
You may think that hunkering down in the center of your plastic-wrapped house, surrounded by buckets of water may not be the ideal way to spend the rest of your life, and you'd be right. But there is a saving grace - called "half-life". Radioactive atoms can release radiation at any moment, but on average, half of them will "go off" within a set amount of time. And because a lot of radioactive fallout elements have a short half-life, it is estimated that the danger of fall-out is reduced about 90% within 3 days, and well over 99% within 3 weeks. So, even after only three days of hunkering down, the risks from fallout are considerably lower.

The other thing you'll want to do is pray for rain. Rain is really effective at pulling any remaining fall-out out of the air (so it will be harder to breathe it in), and also does a decent job at washing the radioactive dust off your roof, off the sidewalk, and either into the sewer, or down into the ground a bit. And fallout in the ground is a lot less dangerous than fallout on the ground (imagine taking an x-ray with even a quarter-inch of soil between you and the camera). At that point, the major concern would be from radioactive elements (like iodine-131) that get absorbed from the ground into crops that you (or your cows) eat.

Hope that helps you sleep better tonight....

Monday, November 12, 2012

Minnesota Precinct-Level Marriage Vote Map

New to the blog? Skip to the Highlight Reel.



On November 6, the voters of Minnesota rejected a proposed amendment to their state Constitution:
"Only a union of one man and one woman shall be valid or recognized as a marriage in Minnesota." It got 48% support, but that support is not at all evenly spread across the state.

Red is in favor of the amendment, green opposed.

The overall trend is that the lowest levels of support were in Minneapolis/Saint Paul, with growing support further from the capitol. It also looks like support for the amendment tended to be a bit lower near the lakes than in land-locked rural areas.

And yeah, it was a lot of work to put this together.

Sunday, November 4, 2012

41.423560%

So this morning while I was walking the dogs, I was thinking about exposure categorization. When your exposure is continuous (i.e. could be a little higher, a little lower, a lot higher, or anywhere in-between), and you prefer categorical analysis (as I do), then it is always arbitrary where you cut the exposure into different levels. You may have a good rationale for choosing a specific method, but it is always a decision you need to make, explicitly.

At any rate, one of the things I like to do is break my exposure up into three or four categories, to get a sense of the consistency of whether there is a dose-response happening (i.e. more exposure->more disease). And when there's no good reason to pick any particular cut-offs, one of the standard things we do is to cut the exposure into thirds - that is, one third of the sample becomes the lowest exposure (reference group), one third becomes the middle exposure group, and one third becomes the higher exposed group. And then you compare the middle group to the reference group and the higher exposure group to the reference group.

But as I was walking, it occurred to me that that's not the most efficient possible way to break things into three pieces, statistically speaking. And that's because the reference group is in two comparisions, and the middle and higher exposure groups are only in one comparison. So, if one could have a slightly larger  reference group, then you would get more statistical power, even if there were fewer people in the other two groups.

Off the cuff, I guessed that if you chose 40% to be the reference group, and 30% each for the middle and higher exposure groups, that would probably be a bit more efficient.

So, when I got home, I tried out some ideas. The main thing I was looking for was to get the confidence limits around the two comparisons as small as possible. In order to test that out in a particular (purely theoretical) example, I assumed that I was trying to estimate the difference between proportions, so the standard errors would be simple to calculate, and then I made another assumption, that the "event rate" was identical in all three groups (that is, there is no dose-response whatsoever). That's not really the assumption I want to make, but it's a simple starting point to work from.
Then, I calculated the standard errors using a third-a third-a third cut-points, and then again using 40% for the reference group, and 30% for the other two, and voila, the 40%:30%:30% splits did have smaller standard errors (red line below) than the 33%:33%:33% ones did (blue line below). It doesn't look much different, but when you're trying to squeeze the maximum statistical power out of the data you've got, this would be a cheap & simple way to do something.
And then I got to thinking, if 40:30:30 is better than 33:33:33, then what is the optimum size for the referent group, in this example? After a bit of futzing around, I figured out that it is about 41.423560%, leaving 29.289322% for each of the comparison groups. That's the green line below - imperceptably more efficient than 40:30:30.
For a four-group categorization, the optimal size for the reference group is 36.60254%, with 21.2324867% in the three comparison groups.
For a five group categorization, the optimal reference group size is exactly 1/3, with 1/6 in each of the other four groups, and for an eight category breakdown (I can't say I recommend splitting so finely), the optimal reference group would be 27.429189%, with 10.36725871% in each of the other 7 groups.
I probably won't pursue this any further because I see stats as the means to the end, and not super interesting in themselves.
If there is a dose-response, these calculations get a bit more complex, and depend on how much of a dose response, the distribution of the exposure, and so on. My guess is that in that case, the optimal size for the reference category would be a bit larger, and there might even be a bit of efficiency gain by making the middle exposure group a tiny bit larger than the higher exposure group.
But enough with the navel-gazing. Time to get back to my paper on how segregation affects how likely one is to experience racially discriminatory events...

Thursday, November 1, 2012

Torture and Truth: Metaphors of Data Analysis

New reader? Skip to The Highlight Reel...

Francis Bacon is credited, rightly or wrongly, with a major turning point in the scientific method - an insistence on empirical, observable evidence, as opposed to reasoning from first principles. He also served in very prominent positions in English politics, and my down-the-hall neighbor, Carolyn Merchant, has done some terrific work tying together his politics and his science, through a lens of how the man thought of women, including the ultimate in feminine mystique - Nature herself. I'm at great risk of mischaracterizing her work, but I'll do my best.
In his writings (in Latin), Bacon frequently used the verb 'vexare' to describe the methods by which Truth could be extracted from Nature. And Carolyn's work shows that how one translates 'vexare' has quite profound implications.
Most modern translations describe 'vexare' as meaning "to vex", which sounds direct, but according to Carolyn, his meaning was probably closer to another interpretation: "to torture", and that several of his early translators in fact rendered 'vexare' as "torture". At the risk of ridiculous oversimplification, did Francis Bacon see the way to provoke the Truth from Nature by vexing her, or by torturing her? Did he imagine Nature giving up her secrets because he, the scientist, had devised a method of constraining her wild unpredictability into a stress position that required her to give up the answer?
Because in Bacon's day, and in Bacon's own mind, torture was seen as a valid method used to get the Truth. We now know that torture does nothing of the kind - it causes the tortured to say whatever they think the torturer wants to hear.

Well, the reason I bring all this up is that at the APHA conference, I saw some results that looked as though Data herself had been tortured more than interrogated by the analytic methods applied to her. In contrast, I also saw lots of evidence of researchers who had sat down with Data, asked her some questions, and got some answers they didn't expect. Rather than ignoring her, or turning the screws to get her to change her tune, they listened carefully to what Data had to say. The mark of a suberb scientist, I think, is knowing the line between interrogation and torture - and figuring out when the answer to a research question should be believed, when it should be ignored, and when one needs to change one's own understanding of the world, especially when the answers contradict what we had hoped to hear.
There is an opposing problem as well - very often we get an answer that is so in-line with our pre-conceptions that we run off to publish without taking the time to check and re-check whether that answer is valid. In other words, our interrogation techniques need not be harsh, but we do need due diligence.

I wish I could say that I never torture Data, that when she speaks, I listen. But the reality is that I often have a very strong pre-conception of what Data should say, and when I don't get the answer I want to hear, my first reaction is to wonder - did I hear her correctly? (i.e. was there mis-coding, or a programming error that transposed the unexpected answer for her true response). My second reaction is that maybe she mis-understood what I meant to ask, so I ask the question again using different phrasing (use a linear rather than a logistic model; re-classify the exposure cut-points or the outcome characterization; include a different set of control variables, etc.). These methods are usually not torture - they are reasonable reactions to past experiences where I have made programming errors, where classification matters a great deal, where omission of a key control variable does result in mis-leading results. But crossing the line to torture at this second stage is far too easy to justify, especially when I have a lot invested (in reputation, world-view, justifying how grant money was spent, etc.) in getting the answers I want to hear. It is easy at this stage to try out a variety of techniques to transform the answer I don't want to hear into one that I do.

My third reaction to pesky Data is to say she's wak. Maybe Data got high before being dragged into the interrogation room and is just giving weird answers to satisfy her own impenetrable sense of humor. Or in other words, are there sampling errors, and/or systematic biases in the data that generate unreliable results?
It is only after many attempts, in many ways, to discount results I don't want to hear, that I take seriously the idea that I may have the wrong idea, that there is a completely different narrative that Data wants to tell. I will have wondered from the start what might explain contrary findings, but I won't replace my pre-conceptions until I'm utterly convinced that I've gotten it wrong. And I think that's the right approach - usually I do ask the wrong question, or if I ask the right question, I might well ask a dataset that is not well-equipped to give the right answer. But every once in a while, I listen carefully, and I hear a story that's much more interesting than the one I had in my head from the beginning. And those stories I don't want to hear - turns out they have happy endings too.

OK, I'll admit it, that last line is pure schmaltz. You got a better way to wrap this ramble up with a tidy bow?