Coding Schemes for Categorical Variables
The numerical coding of categorical variables plays a major role in their interpretation. Yet, the existence of different coding schemes is rarely discussed in introductory courses, much less the effect the choice of coding has on the analysis. In the brain and behavioural sciences, this may be an artefact of the ANOVA tradition, but with the recent move towards mixed-effects models and other explicit forms of regression, it makes sense to consider the role of coding schemes. In the following, we will consider a few of the most common coding schemes via a carefully worked example. Other discussions of numerical and practical aspects of coding are availble online (e.g. meaning and particulars of doing it in R, Dale Barr’s discussion of the difference between simple and main effects).
An Example
In the following, we will consider a set of hypothetical categorial manipulations in a language experiment: animacy and morphology. We can think of animacy as having two levels: “animate” and “inanimate”. We will consider “inanimate” the baseline (control or reference) case and “animate” as the deviating (treatment or marked) case. We can think of morphology has having three levels: “explicitly marked nominative”, “ambiguous”, “explicitly marked accusative”. For convenience and brevity, we will leave out the “explicitly marked” in the following text. We will consider “ambiguous” the baseline (control or reference) case and “nominative” and “accusative” as the deviating (treatment or marked) cases.
In general, a categorical variable with \(n\) levels requires \(n-1\) numerical variables to properly encode it. So for our animacy manipulation, our linear model would look something like:
\[Y = \beta_{0} + \beta_{1}X_1 + \varepsilon\]
where \(Y\) is our response variable (mean EEG, response time, ratings, etc.), \(\beta_0\) is the intercept, \(\beta_1\) is the slope associated with \(X_1\), which encodes animate/inanimate, and \(\varepsilon\) is the residual error term. (For simplicity, we focus on fixed-effects regression, but the same remarks hold in mixed-effects regression; they are just additional per-grouping “error” terms.)
Similarly, for the morphology manipulation, we have
\[Y = \beta_{0} + \beta_{1}X_1 + \beta_{2}X_2 + \varepsilon\]
where the three levels of morphology are encoded between the \(X_1\) and \(X_2\). To simplify things a bit, we will omit the error term \(\varepsilon\) in the following.
Simple Regression
Dummy coding
In dummy coding, the coding variables represent binary true/false values for the deviation cases. The reference level is encoded as not being any of the other levels, i.e. as all zeros. For animacy, we would thus have \(X_1 = 1\) for “animate” and \(X_1 = 0\) for inanimate, which is typically presented as a contrast matrix thus:
level | \(X_1\) |
---|---|
animate | 1 |
innanimate | 0 |
For morphology, we would have
level | \(X_1\) | \(X_2\) |
---|---|---|
nominative | 1 | 0 |
accusative | 0 | 1 |
ambiguous | 0 | 0 |
Filling in the contrast levels, we have the following equation for the animate condition: \[ Y_\text{animate} = \beta_0 + \beta_1 (1)\] and for the inanimate condition: \[ Y_\text{inanimate} = \beta_0 + \beta_1 (0) = \beta_0\]
Thus, \(\beta_0\) encodes the expected response for the reference level, while \(\beta_1\) encodes the difference between the reference level and the other level. For this reason, dummy coding is also called “treatment coding”, because it can be easily used to model and calculate the difference between a control and a treatment group.
For the morphology manipulation with three levels, things are a little bit more complex, but the same general pattern holds. For nominative, we have: \[ Y_\text{nominative} = \beta_0 + \beta_1 (1) + \beta_2 (0) = \beta_0 + \beta_1\] For accusative we have: \[ Y_\text{accusative} = \beta_0 + \beta_1 (0) + \beta_2 (1) = \beta_0 + \beta_2\] And for ambiguous, we have: \[ Y_\text{ambiguous} = \beta_0 + \beta_1 (0) + \beta_2 (0) = \beta_0\] Again, the intercept encodes the response for the reference level, and each of the other terms encodes individual contrasts (\(X_1\) = nominatve > ambiguous, and \(X_2\) = accusative > ambiguous).
In brief: the intercept is the reference level, and the Wald \(t\)-test for this coefficient tests the hypothesis that the reference level is significantly different from zero. The other coefficients are the differences from the reference level, and the Wald \(t\)-tests test whether these levels significantly differ from the reference level.
Sum coding
Sum coding follows the same general pattern as treatment coding, but the reference level receives all -1’s instead of all zeros. In the case of animacy, this yields the following contrast matrix:
level | \(X_1\) |
---|---|
animate | 1 |
innanimate | -1 |
For morphology, we have the following contrast matrix:
level | \(X_1\) | \(X_2\) |
---|---|---|
nominative | 1 | 0 |
accusative | 0 | 1 |
ambiguous | -1 | -1 |
Although this is quite similar to dummy coding on the surface, this small difference has a big impact.
Filling in the contrast levels, we have the following equation for the animate condition: \[ Y_\text{animate} = \beta_0 + \beta_1 (1)\] and for the inanimate condition: \[ Y_\text{inanimate} = \beta_0 + \beta_1 (-1) = \beta_0 - \beta_1\]
It’s not immediately obvious what the intercept \(\beta_0\) encodes. But a little bit of algebra reveals the truth (quote me on that!). If we average the response for animate and inanimate, we have:
\[ \frac{Y_\text{animate} + Y_\text{inanimate}}{2} = \frac{\left(\beta_0 + \beta_1\right) + \left(\beta_0 - \beta_1\right)}{2} = \frac{2\beta_0}{2} = \beta_0\]
And thus, the intercept represents the mean between conditions. The slope \(\beta_1\) represents the deviation between conditions and the mean. (Sum coding is also sometimes caleld deviation coding for this reason, but deviation coding can also be used to refer to a variant, see below.) Because only two levels are encoded in the varibles \(X_1\), the mean falls exactly at the midpoint between the two conditions. In other words, \(\beta_1\) is half the difference between conditions, so to compute the difference between conditions, we need to double \(\beta_1\). We can also see this more directly: \[ Y_\text{animate} - Y_\text{inanimate} = \left(\beta_0 + \beta_1\right) - \left( \beta_0 - \beta_1 \right) = \beta_0 + \beta_1 - \beta_0 + \beta_1 = 2\beta_1\]
For the three-level case, things work similarly, but there are a few extra complexities. For nominative, we have: \[ Y_\text{nominative} = \beta_0 + \beta_1 (1) + \beta_2 (0) = \beta_0 + \beta_1 \] For accusative we have: \[ Y_\text{accusative} = \beta_0 + \beta_2 (1) = \beta_0 + \beta_2 \] And for ambiguous, we have: \[ Y_\text{ambiguous} = \beta_0 + \beta_1 (-1) + \beta_2 (-1) = \beta_0 - \beta_1 - \beta_2 \]
Again, the intercept \(\beta_0\) is the mean:
\[ \frac{Y_\text{nominative} + Y_\text{accusative} + Y_\text{ambiguous}}{3} = \frac{\left(\beta_0 + \beta_1\right) + \left(\beta_0 + \beta_2\right) + \left(\beta_0 -\beta_1 - \beta_2\right)}{3} = \frac{3\beta_0}{3} = \beta_0\]
And the other coefficients represent deviations from the mean: \(\beta_1\) = nominative > mean, \(\beta_2\) = accusative > mean. For the reference level, the formula is slightly more complicated; the deviation from the mean is \(-(\beta_1 + \beta_2)\). We can also calculate formulae for direct pairwise comparisons, but generally it is better to use other methods (such as least-squares means), which take into account the associated error.
In brief: the intercept is the mean, and the Wald \(t\)-test for this coefficient tests the hypothesis that the mean is significantly different from zero. The other coefficients are the differences from the mean, and the Wald \(t\)-tests test whether these levels significantly differ from the mean.
Deviation Coding
As mentioned previously, sum coding is sometimes referred to as deviation coding, but deviation coding is sometimes used in a narrower sense to refer to a variant of sum encoding, where ±0.5 is used instead of ±1. So for animacy, we would have:
level | \(X_1\) |
---|---|
animate | 0.5 |
innanimate | -0.5 |
and for morphology, we would have:
level | \(X_1\) | \(X_2\) |
---|---|---|
nominative | 0.5 | 0 |
accusative | 0 | 0.5 |
ambiguous | -0.5 | -0.5 |
The advantage to this system is that the distance between contrasts sums to 1 instead of to 2. More concretely, for the contrast animate > inanimate, we have:
\[ Y_\text{animate} - Y_\text{inanimate} = \left(\beta_0 + 0.5 \beta_1\right) - \left( \beta_0 - 0.5 \beta_1 \right) = \beta_0 + 0.5 \beta_1 - \beta_0 + 0.5 \beta_1 = \beta_1\]
For three levels, this variation does not really provide any additional convenience.
Multiple Regression and Interactions
For multiple regression, the differences introduce an additional sublety in that they change the meaning of all but the highest level interaction. For now, we will focus on a 2 x 2 interaction, e.g. the animacy of the syntactic subject and the expected animacy (“biologicallness”) of the verb. For higher level interactions and interactions between factors with more levels, things get even more complicated, and it is perhaps best to plot the data and get a feel for them that way, rather than examine individual coefficients. For significance tests, a test of linear hypotheses using Wald \(\chi^2\) or Wald \(F\) tests (preferabbly Type-II !) would offer the most straightforward possibility.
For the subject-animacy by verbal-animacy manipulation, we have the following four conditions:
noun | verb | interaction |
---|---|---|
animate | animate | match |
innanimate | animate | mismatch |
animate | inanimate | mismatch |
innanimate | inanimate | match |
For four conditions, we need \(4-1=3\) numerical variables for the coding, and our linear model looks thusly:
\[Y = \beta_{0} + \beta_{1}X_1 + \beta_{2}X_2 + \beta_{1,2}X_{1}X_{2} + \varepsilon\]
As above, we will omit the error term in our subsequent considerations.
For this particular case, we rename our coefficients more intuitively:
\[Y = \beta_{\text{int}} + \beta_{\text{noun}}X_{\text{noun}} + \beta_{\text{verb}}X_{\text{verb}} + \beta_{\text{noun},\text{verb}}X_{\text{noun}}X_{\text{verb}} \]
Dummy coding
Using dummy coding, we again have the following contrast matrix for both the noun and the verb:
level | \(X\) |
---|---|
animate | 1 |
innanimate | 0 |
Filling in the contrast levels, we have the following equation for the animate-animate match condition: \[ Y_\text{animate,animate} = \beta_{\text{int}} + \beta_{\text{noun}}(1) + \beta_{\text{verb}}(1) + \beta_{\text{noun},\text{verb}}(1)(1) = \beta_{\text{int}} + \beta_{\text{noun}} + \beta_{\text{verb}}+ \beta_{\text{noun},\text{verb}} \]
for the inanimate-animate mismatch condition: \[ Y_\text{inanimate,animate} = \beta_{\text{int}} + \beta_{\text{noun}}(0) + \beta_{\text{verb}}(1) + \beta_{\text{noun},\text{verb}}(0)(1) = \beta_{\text{int}} + \beta_{\text{verb}} \]
for the animate-inanimate mismatch condition: \[ Y_\text{animate,inanimate} = \beta_{\text{int}} + \beta_{\text{noun}}(1) + \beta_{\text{verb}}(0) + \beta_{\text{noun},\text{verb}}(1)(0) = \beta_{\text{int}} + \beta_{\text{noun}} \]
and for the inanimate-inanimate match condition: \[ Y_\text{inanimate,inanimate} = \beta_{\text{int}} + \beta_{\text{noun}}(0) + \beta_{\text{verb}}(0) + \beta_{\text{noun},\text{verb}}(0)(0) = \beta_{\text{int}} \]
Again, the intercept encodes the reference level, but “at the interaction level”. In other words, the reference level for the model is the full condition “inanimate-inanimate” and the intercept encodes this. The Wald \(t\)-test for this coefficient thus tests whether this combined condition is significantly different from zero. Similarly, we can see via the mismatch conditions, that \(\beta_\text{noun}\) and \(\beta_\text{noun}\) encode the difference in one factor from the reference level, while holding all other factors constant. This is a so-called simple effect, as it tests whether a different level of one condition within a constant level of another condition, is significantly different from zero. Simple effects are not the same as main effects, as seen in traditional ANOVA analyses, which are marginal[1] tests, i.e. they test whether there a different level of one condition across all levels of other conditions, significantly differ from zero.
For test of main effects, we need sum encoding.
Sum coding
Using sum coding, we again have the following contrast matrix for both the noun and the verb:
level | \(X\) |
---|---|
animate | 1 |
innanimate | -1 |
Filling in the contrast levels, we have the following equation for the animate-animate match condition: \[ Y_\text{animate,animate} = \beta_{\text{int}} + \beta_{\text{noun}}(1) + \beta_{\text{verb}}(1) + \beta_{\text{noun},\text{verb}}(1)(1) = \beta_{\text{int}} + \beta_{\text{noun}} + \beta_{\text{verb}} + \beta_{\text{noun},\text{verb}} \]
for the inanimate-animate mismatch condition: \[ Y_\text{inanimate,animate} = \beta_{\text{int}} + \beta_{\text{noun}}(-1) + \beta_{\text{verb}}(1) + \beta_{\text{noun},\text{verb}}(-1)(1) = \beta_{\text{int}} - \beta_{\text{noun}} + \beta_{\text{verb}} - \beta_{\text{noun},\text{verb}} \]
for the animate-inanimate mismatch condition: \[ Y_\text{animate,inanimate} = \beta_{\text{int}} + \beta_{\text{noun}}(1) + \beta_{\text{verb}}(-1) + \beta_{\text{noun},\text{verb}}(1)(0) = \beta_{\text{int}} + \beta_{\text{noun}} - \beta_{\text{verb}} - \beta_{\text{noun},\text{verb}} \]
and for the inanimate-inanimate match condition: \[ Y_\text{inanimate,inanimate} = \beta_{\text{int}} + \beta_{\text{noun}}(-1) + \beta_{\text{verb}}(-1) + \beta_{\text{noun},\text{verb}}(-1)(-1) = \beta_{\text{int}} -\beta_{\text{noun}} - \beta_{\text{verb}} + \beta_{\text{noun},\text{verb}} \]
Again, the intercept encodes the mean across all conditions and the other coefficients encode deviations from the mean. Now, however, the individual coefficients encode marginal effects, which you can intuitively see for the mismatch conditions. For the animate-inanimate condition, the effect of the noun is added in, and the effect of the verb and the interaction term are subtracted out. Similarly, for the inanimate-animate condition, the effect of the verb is added in, and the effect of the noun and the interaction term are subtracted out. For the animate-animate condition, the effect of the noun, the effect of the verb, and the interaction are all added in. For the inanimate-inanimate condition, i.e. the interaction-level “reference” condition, the effect of the verb and the effect of the noun are substracted out, but the interation term remains. Intuitively, the interaction term remains because whatever level the individual factors have, they can interact (contructively or destructively) with each other. The usual explanation of interaction is that the effect of one variable is dependent on the value of another variable and it is thus dangerous and misleading to interpret the (marginal) effect of one variable as if it were constant – you shouldn’t interpret main effects in the presence of interactions. The animate-animate and inanimate-inanimate conditions show mathematically why this is problematic – the interaction term sticks around!
Going the other direction, there is also the principle of marginality, which states that a model including a higher-order term (such as an interaction) should also include the lower-order constintuents (such as the main effects). If we remove the main effects from the above model, then our four, mathematically distinct conditions collape into two conditions, “match” and “mismatch”. While there may be cases where this is interesting, it fails to capture the variation between the levels of each factor and cannot distinguish between inanimate-animate and animate-inanimate nor between animate-animate and inanimate-inanimate.
A comparison of the coding schemes for a 2x2x2 design
We can consider a slightly more complicated 2x2x2 design, such as the one arising when we extend the previous 2x2 design to transitive verbs and parametrically manipulate the animacy of the direct object. The following table compare the end results of the two coding schemes; it should be readily apparent that the individual expresions are quite different and thus the interpretations of the coefficients as well. The only exception to this rule is the expression for highest-level interaction, but even then the individuals coefficients are interpeted differently because of differing distribution across the other conditions.
subject | verb | object | dummy | sum |
---|---|---|---|---|
I | I | I | \[\beta_0\] | \[\beta_0 - \beta_1 - \beta_2 - \beta_3 + \beta_{1,2} + \beta_{1,3} + \beta_{2,3} - \beta_{1,2,3} \] |
I | I | A | \[\beta_0 + \beta_3 \] | \[\beta_0 - \beta_1 - \beta_2 + \beta_3 + \beta_{1,2} - \beta_{1,3} - \beta_{2,3} + \beta_{1,2,3} \] |
I | A | I | \[\beta_0 + \beta_2 \] | \[\beta_0 - \beta_1 + \beta_2 - \beta_3 - \beta_{1,2} + \beta_{1,3} - \beta_{2,3} + \beta_{1,2,3} \] |
A | I | I | \[\beta_0 + \beta_1 \] | \[\beta_0 + \beta_1 - \beta_2 - \beta_3 - \beta_{1,2} - \beta_{1,3} + \beta_{2,3} + \beta_{1,2,3} \] |
I | A | A | \[\beta_0 + \beta_2 + \beta_3 + \beta_{2,3} \] | \[\beta_0 - \beta_1 + \beta_2 + \beta_3 - \beta_{1,2} - \beta_{1,3} + \beta_{2,3} - \beta_{1,2,3} \] |
A | A | I | \[\beta_0 + \beta_1 + \beta_2 + \beta_{1,2} \] | \[\beta_0 + \beta_1 + \beta_2 - \beta_3 + \beta_{1,2} - \beta_{1,3} - \beta_{2,3} - \beta_{1,2,3} \] |
A | I | A | \[\beta_0 + \beta_1 + \beta_3 + \beta_{1,3} \] | \[\beta_0 + \beta_1 - \beta_2 + \beta_3 - \beta_{1,2} + \beta_{1,3} - \beta_{2,3} - \beta_{1,2,3} \] |
A | A | A | \[\beta_0 + \beta_1 + \beta_2 + \beta_3 + \beta_{1,2} + \beta_{1,3} + \beta_{2,3} + \beta_{1,2,3} \] | \[\beta_0 + \beta_1 + \beta_2 + \beta_3 + \beta_{1,2} + \beta_{1,3} + \beta_{2,3} + \beta_{1,2,3} \] |
The general form for dummy encoding is “baseline + influence of interest”, while the general form for sum encoding is “mean - ignored influences + influence of interest”. For the interaction terms, we see that two-way interactions have positive sign when both factors are in the same direction and negative sign when they are not – i.e. two-way factors can work either constructively to boost the effect or destructively to decrease the effect. (For factors of more than two levels, there is another possibility: a certain combination of levels is irrelevant compared to the mean and the contrast term cancels out. This happens when at least one of the factors is at a level encoded at zero. ) For three-way interactions, the meaning of the sign is a bit more complicated, but roughly indicates whether the majority of factors are working against the “baseline” level (i.e. the level coded with -1), in which case the sign is -1. (You can remember this because “negative” can in casual conversaion mean “against you” and these factors are working against the baseline.)
For variables of more than two levels or still higher-order interactions, the complexity explodes. The number of coefficients in the model is exactly equal to the number of levels of the manipulation (note that for the 2x2x2 manipulation, we had exactly 2x2x2=8 different coefficients).
If you got all the way to here, then you’ve earned the privelege of looking at my slides on a lazy approach to doing contrasts right.
Notes
1. The term “marginal” in statistics has a slightly different meaning than in everyday use. “Marginal” in statistics means vaguely “unconditional”, “not conditional on another variable”, or “with all possibilities considered” and comes from the traditional way of displaying probabilities in tables. The “marginal” probabilities were written at the other edge, i.e. the margins. In everyday life, things in the margins or at the edges are typically less important, and that’s where the common usage comes from. Statistics seems to make a habit of using a common term – marginal, sigificant, etc. – and giving it a specific, technical meaning only vaguely related to the non-technical meaning.↩︎