12 💻 How dummy vars work
Categorical variables (also known as factor or qualitative variables) are variables that classify observations into groups. They have a limited number of different values, called levels. For example the gender of individuals are a categorical variable that can take two levels: Male or Female.
Regression analysis requires numerical variables. So, when a researcher wishes to include a categorical variable in a regression model, supplementary steps are required to make the results interpretable.
12.1 examples data from car
pkg Salaries
We’ll use the Salaries
data set from car
pkg, which contains 2008/2009 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S.
rank | discipline | yrs.since.phd | yrs.service | sex | salary | |
---|---|---|---|---|---|---|
1 | Prof | B | 19 | 18 | Male | 139750 |
2 | Prof | B | 20 | 16 | Male | 173200 |
3 | AsstProf | B | 4 | 3 | Male | 79750 |
4 | Prof | B | 45 | 39 | Male | 115000 |
5 | Prof | B | 40 | 41 | Male | 141500 |
6..396 | ||||||
397 | AsstProf | A | 8 | 4 | Male | 81035 |
12.2 Categorical variables with two levels
Since now we have pretty much seen regression with only numeric predictors, but the most of the times you are going to deal with categorical predictors like we did for regression in iris
dataset.
Recall that, the regression equation, for predicting an outcome variable on the basis of a predictor variable , can be simply written as .
Suppose that, we wish to investigate differences in salaries between males and females.
Based on the gender variable, we can create a new dummy variable that takes the value:
- 1 if a person is male
- 0 if a person is female
and use this variable as a predictor in the regression equation, leading to the following the model:
- if person is male
- if person is female
The coefficients can be interpreted as follow:
- is the average salary among females,
- is the average salary among males,
- and is the average difference in salary between males and females.
For simple demonstration purpose, the following example models the salary difference between males and females by computing a simple linear regression model on the Salaries
data set.
R creates dummy variables automatically:
model <- lm(salary ~ sex, data = Salaries)
summary(model)$coef
#> Estimate Std. Error t value
#> (Intercept) 101002.41 4809.386 21.001103
#> sexMale 14088.01 5064.579 2.781674
#> Pr(>|t|)
#> (Intercept) 0.000000000000000000000000000000000000000000000000000000000000000002683482
#> sexMale 0.005667106519338906828187063524637778755277395248413085937500000000000000
From the output above, the average salary for female is estimated to be 101002, whereas males are estimated a total of 101002 + 14088 = 115090.
The p-value for the dummy variable sexMale is very significant, suggesting that there is a statistical evidence of a difference in average salary between the genders.
What happened is taht R has created a sexMale
dummy variable that takes on a value of 1 if the sex is Male, and 0 otherwise (Females). The decision to code males as 1 and females as 0 (baseline) is arbitrary, and has no effect on the regression computation, but does alter the interpretation of the coefficients.
If you are willign to change the factor orders than you may expect to find a negative coefficient for SexFemale.
12.3 Categorical variables with more than two levels
Generally, a categorical variable with levels will be transformed into n-1 variables each with two levels. These new variables contain the same information than the single variable.
Lets take for example rank in the Salaries data has three levels: AsstProf
, AssocProf
and Prof.
This variable could be dummy coded into two variables, one called AssocProf
and one Prof
(i.e. )
That is to say:
- If rank =
AssocProf
, then the columnAssocProf
would be coded with a 1 andProf
with a 0. - If rank =
Prof
, then the columnAssocProf
would be coded with a 0 andProf
would be coded with a 1. - If rank =
AsstProf
, then both columnsAssocProf
andProf
would be coded with a 0.
This dummy coding is automatically performed by R. For demonstration purpose, you can use the function model.matrix() to create a contrast matrix for a factor variable, this is how it would look like (remember R handles that for you):
res <- model.matrix(~rank, data = Salaries)
head(res[, -1])
#> rankAssocProf rankProf
#> 1 0 1
#> 2 0 1
#> 3 0 0
#> 4 0 1
#> 5 0 1
#> 6 1 0
When building linear model, there are different ways to encode categorical variables, known as contrast coding systems. The default option in R is to use the first level of the factor as a reference and interpret the remaining levels relative to this level.
Now let’s fit the model and see results:
library(car)
model2 <- lm(salary ~ yrs.service + rank + discipline + sex,
data = Salaries)
summary(model2)
#>
#> Call:
#> lm(formula = salary ~ yrs.service + rank + discipline + sex,
#> data = Salaries)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -64202 -14255 -1533 10571 99163
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 68351.67 4482.20 15.250
#> yrs.service -88.78 111.64 -0.795
#> rankAssocProf 14560.40 4098.32 3.553
#> rankProf 49159.64 3834.49 12.820
#> disciplineB 13473.38 2315.50 5.819
#> sexMale 4771.25 3878.00 1.230
#> Pr(>|t|)
#> (Intercept) < 0.0000000000000002 ***
#> yrs.service 0.426958
#> rankAssocProf 0.000428 ***
#> rankProf < 0.0000000000000002 ***
#> disciplineB 0.0000000124 ***
#> sexMale 0.219311
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 22650 on 391 degrees of freedom
#> Multiple R-squared: 0.4478, Adjusted R-squared: 0.4407
#> F-statistic: 63.41 on 5 and 391 DF, p-value: < 0.00000000000000022
For example, it can be seen that being from discipline B (applied departments) is significantly associated with an average increase of 13473.38 in salary compared to discipline A (theoretical departments).
12.4 oh there’s another way to say this.. OneHotEconding
One-hot encoding is the process of converting a categorical variable with multiple categories into multiple variables, each with a value of 1 or 0. There are many packages that does that even if R handles that for you.
For the methods outlined below, the following simple dataframe will be required:
set.seed(28)
data <- data.frame(
Outcome = seq(1,100,by=1),
Variable = sample(c("Red","Green","Blue"), 100, replace = TRUE)
)
Now starting from this dataframe you can one hot encode i.e. convert factors to 0s and 1s such as:
library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice
dummy <- dummyVars(" ~ .", data=data)
newdata <- data.frame(predict(dummy, newdata = data))
12.5 handling factors in R
Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.
Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:
nlevels(sex)
#> [1] 2
now with 3