At any rate, one of the things I like to do is break my exposure up into three or four categories, to get a sense of the consistency of whether there is a dose-response happening (i.e. more exposure->more disease). And when there's no good reason to pick any particular cut-offs, one of the standard things we do is to cut the exposure into thirds - that is, one third of the sample becomes the lowest exposure (reference group), one third becomes the middle exposure group, and one third becomes the higher exposed group. And then you compare the middle group to the reference group and the higher exposure group to the reference group.
But as I was walking, it occurred to me that that's not the most efficient possible way to break things into three pieces, statistically speaking. And that's because the reference group is in two comparisions, and the middle and higher exposure groups are only in one comparison. So, if one could have a slightly larger reference group, then you would get more statistical power, even if there were fewer people in the other two groups.
Off the cuff, I guessed that if you chose 40% to be the reference group, and 30% each for the middle and higher exposure groups, that would probably be a bit more efficient.
So, when I got home, I tried out some ideas. The main thing I was looking for was to get the confidence limits around the two comparisons as small as possible. In order to test that out in a particular (purely theoretical) example, I assumed that I was trying to estimate the difference between proportions, so the standard errors would be simple to calculate, and then I made another assumption, that the "event rate" was identical in all three groups (that is, there is no dose-response whatsoever). That's not really the assumption I want to make, but it's a simple starting point to work from.
Then, I calculated the standard errors using a third-a third-a third cut-points, and then again using 40% for the reference group, and 30% for the other two, and voila, the 40%:30%:30% splits did have smaller standard errors (red line below) than the 33%:33%:33% ones did (blue line below). It doesn't look much different, but when you're trying to squeeze the maximum statistical power out of the data you've got, this would be a cheap & simple way to do something.
For a four-group categorization, the optimal size for the reference group is 36.60254%, with 21.2324867% in the three comparison groups.
For a five group categorization, the optimal reference group size is exactly 1/3, with 1/6 in each of the other four groups, and for an eight category breakdown (I can't say I recommend splitting so finely), the optimal reference group would be 27.429189%, with 10.36725871% in each of the other 7 groups.
I probably won't pursue this any further because I see stats as the means to the end, and not super interesting in themselves.
If there is a dose-response, these calculations get a bit more complex, and depend on how much of a dose response, the distribution of the exposure, and so on. My guess is that in that case, the optimal size for the reference category would be a bit larger, and there might even be a bit of efficiency gain by making the middle exposure group a tiny bit larger than the higher exposure group.
But enough with the navel-gazing. Time to get back to my paper on how segregation affects how likely one is to experience racially discriminatory events...