12 💻 How dummy vars work

Categorical variables (also known as factor or qualitative variables) are variables that classify observations into groups. They have a limited number of different values, called levels. For example the gender of individuals are a categorical variable that can take two levels: Male or Female.

Regression analysis requires numerical variables. So, when a researcher wishes to include a categorical variable in a regression model, supplementary steps are required to make the results interpretable.

12.1 examples data from `car` pkg `Salaries`

We’ll use the Salaries data set from car pkg, which contains 2008/2009 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S.

	rank	discipline	yrs.since.phd	yrs.service	sex	salary
1	Prof	B	19	18	Male	139750
2	Prof	B	20	16	Male	173200
3	AsstProf	B	4	3	Male	79750
4	Prof	B	45	39	Male	115000
5	Prof	B	40	41	Male	141500
6..396
397	AsstProf	A	8	4	Male	81035

12.2 Categorical variables with two levels

Since now we have pretty much seen regression with only numeric predictors, but the most of the times you are going to deal with categorical predictors like we did for regression in iris dataset.

Recall that, the regression equation, for predicting an outcome variable $Y$ on the basis of a predictor variable $X$ , can be simply written as $Y = α + β_{1} * X$ .

Suppose that, we wish to investigate differences in salaries between males and females.

Based on the gender variable, we can create a new dummy variable that takes the value:

1 if a person is male
0 if a person is female

and use this variable as a predictor in the regression equation, leading to the following the model:

$α + β_{1}$ if person is male
$α$ if person is female

The coefficients can be interpreted as follow:

$α$ is the average salary among females,
$α + β_{1}$ is the average salary among males,
and $β_{1}$ is the average difference in salary between males and females.

For simple demonstration purpose, the following example models the salary difference between males and females by computing a simple linear regression model on the Salaries data set. R creates dummy variables automatically:

model <- lm(salary ~ sex, data = Salaries)
summary(model)$coef
#>              Estimate Std. Error   t value
#> (Intercept) 101002.41   4809.386 21.001103
#> sexMale      14088.01   5064.579  2.781674
#>                                                                               Pr(>|t|)
#> (Intercept) 0.000000000000000000000000000000000000000000000000000000000000000002683482
#> sexMale     0.005667106519338906828187063524637778755277395248413085937500000000000000

From the output above, the average salary for female is estimated to be 101002, whereas males are estimated a total of 101002 + 14088 = 115090.

The p-value for the dummy variable sexMale is very significant, suggesting that there is a statistical evidence of a difference in average salary between the genders.

What happened is taht R has created a sexMale dummy variable that takes on a value of 1 if the sex is Male, and 0 otherwise (Females). The decision to code males as 1 and females as 0 (baseline) is arbitrary, and has no effect on the regression computation, but does alter the interpretation of the coefficients.

If you are willign to change the factor orders than you may expect to find a negative coefficient for SexFemale.

12.3 Categorical variables with more than two levels

Generally, a categorical variable with $n$ levels will be transformed into n-1 variables each with two levels. These $n - 1$ new variables contain the same information than the single variable.

Lets take for example rank in the Salaries data has three levels: AsstProf, AssocProf and Prof. This variable could be dummy coded into two variables, one called AssocProf and one Prof (i.e. $n - 1$ ) That is to say:

If rank = AssocProf, then the column AssocProf would be coded with a 1 and Prof with a 0.
If rank = Prof, then the column AssocProf would be coded with a 0 and Prof would be coded with a 1.
If rank = AsstProf, then both columns AssocProf and Prof would be coded with a 0.

This dummy coding is automatically performed by R. For demonstration purpose, you can use the function model.matrix() to create a contrast matrix for a factor variable, this is how it would look like (remember R handles that for you):

res <- model.matrix(~rank, data = Salaries)
head(res[, -1])
#>   rankAssocProf rankProf
#> 1             0        1
#> 2             0        1
#> 3             0        0
#> 4             0        1
#> 5             0        1
#> 6             1        0

When building linear model, there are different ways to encode categorical variables, known as contrast coding systems. The default option in R is to use the first level of the factor as a reference and interpret the remaining levels relative to this level.

Now let’s fit the model and see results:

library(car)
model2 <- lm(salary ~ yrs.service + rank + discipline + sex,
             data = Salaries)
summary(model2)
#> 
#> Call:
#> lm(formula = salary ~ yrs.service + rank + discipline + sex, 
#>     data = Salaries)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -64202 -14255  -1533  10571  99163 
#> 
#> Coefficients:
#>               Estimate Std. Error t value
#> (Intercept)   68351.67    4482.20  15.250
#> yrs.service     -88.78     111.64  -0.795
#> rankAssocProf 14560.40    4098.32   3.553
#> rankProf      49159.64    3834.49  12.820
#> disciplineB   13473.38    2315.50   5.819
#> sexMale        4771.25    3878.00   1.230
#>                           Pr(>|t|)    
#> (Intercept)   < 0.0000000000000002 ***
#> yrs.service               0.426958    
#> rankAssocProf             0.000428 ***
#> rankProf      < 0.0000000000000002 ***
#> disciplineB           0.0000000124 ***
#> sexMale                   0.219311    
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 22650 on 391 degrees of freedom
#> Multiple R-squared:  0.4478, Adjusted R-squared:  0.4407 
#> F-statistic: 63.41 on 5 and 391 DF,  p-value: < 0.00000000000000022

For example, it can be seen that being from discipline B (applied departments) is significantly associated with an average increase of 13473.38 in salary compared to discipline A (theoretical departments).

12.4 oh there’s another way to say this.. OneHotEconding

One-hot encoding is the process of converting a categorical variable with multiple categories into multiple variables, each with a value of 1 or 0. There are many packages that does that even if R handles that for you.

For the methods outlined below, the following simple dataframe will be required:

set.seed(28)

data <- data.frame(
  Outcome = seq(1,100,by=1),
  Variable = sample(c("Red","Green","Blue"), 100, replace = TRUE)
)

Now starting from this dataframe you can one hot encode i.e. convert factors to 0s and 1s such as:

library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice

dummy <- dummyVars(" ~ .", data=data)
newdata <- data.frame(predict(dummy, newdata = data))

12.5 handling factors in R

Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.

Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

sex <- factor(c("male", "female", "female", "male"))
levels(sex)
#> [1] "female" "male"

nlevels(sex)
#> [1] 2

now with 3

food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food)
#> [1] "high"   "low"    "medium"

11 💻 Intermediate results!

13 💻 Intermediate prep