9 💻 Linear Regression in class
9.1 General Linear Regression Worflow & Caveats 🗒
fix in memory the following steps:
install package if you already dot have that and load it i.e.
install.package("<the package>")
library(<the package>)
. remember also that install.package needs quotation, indeed library don’t. Make sure also to have internet connection, otherwise you can’t download anything.Load data into your environment with
data("<the dataset>")
bis further data manipulation step:
- select columns ()
- filter rows
- encode to factor/ numeric
do the model with
lm(formula = Y ~ X + X2 + X3, data = <the dataset>)
where formula is the model you are fitting (understand it by the question) and data is the actual datasetsummary of the model
Please also don’t just skip parenthesis, quotations i.e. ““, Capital and lowercase characters. They all matters! Prior throwing outside the window you laptop at least check if there’s any of the issues in this code.
9.2 Class exercies @ 2023-02-09 🏛
Exercise 9.1 Using the dataset baltimore
downloadable from the library spdep
Estimate the simple linear regression model which explains the price (PRICE)
as a function the “age” of the house (AGE)
> Is the slope significant at 5%?
# if you don't have that, install it at first
# install.packages("spdep")
library(spdep)
data("baltimore")
# let's look at baltimore
# View(baltimore)
dim(baltimore)
str(baltimore)
colnames(baltimore)
## estimate regression
baltimore_regression = lm(PRICE ~ AGE, data = baltimore)
summary(baltimore_regression)
Answer to Exercise 9.1:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 55.08500 2.82833 19.48 < 0.0000000000000002
AGE -0.35802 0.07851 -4.56 0.0000087
—
coefficient AGE is significant up to a 0% alpha level, therefore it also is at 5%
Exercise 9.2 The coleman data set in the robustbase library lists summary statistics for 20 different schools in the northeast US. The six variables measured on each school include demographic information (such as percent of white-collar fathers) and characteristics of each school (such as staff salaries per pupil). > 1. Code a regression model explaining the variable Y, using all other varibles. > 2. Which variable is most significant?
# install.packages("robustbase") if you dont already have the package
library(robustbase)
data("coleman")
## explore data, how many rows it has?
## View(coleman)
str(coleman)
model_coleman = lm(formula = Y ~ ., data = coleman)
summary(model_coleman)
Answer to Exercise 9.2:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.94857 13.62755 1.464 0.1653
salaryP -1.79333 1.23340 -1.454 0.1680
fatherWc 0.04360 0.05326 0.819 0.4267
sstatus 0.55576 0.09296 5.979 0.0000338 **
teacherSc 1.11017 0.43377 2.559 0.0227
motherLev -1.81092 2.02739 -0.893 0.3868
the most significant coefficient in the regression model we fitted is “sstatus” with a nearly 0 alpha, however there’s also “teacherSc” which is significant witha 5% alpha.
Exercise 9.3 Using the dataset Bikeshare of the library ISLR2, Estimate the regression that explains causal users as a function of the registered users > What is the p-value of the estimate of the intercept?
Bikeshare data descr: This data set contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system, along with weather and seasonal information.
# for those who can not install that, don't worry,
# manifest tyour problem and I will make avaible for you the dataset
# in a different way!
# install.packages("ISLR2")
library(ISLR2)
data("Bikeshare")
## !! for those we cant install ISLR2 then execte the follwing !!
Bikeshare = read.csv("https://raw.githubusercontent.com/NiccoloSalvini/sbd_22-23/main/data/bikeshare.csv?token=GHSAT0AAAAAABZG6GFQDNXQRWROYOGNIKCIY2X7Y4Q")
## explore data, focus on column types
str(Bikeshare)
## fit model
bikeshare_model = lm(formula = casual~registered, data = Bikeshare)
summary(bikeshare_model)
::: {.answer data-latex=““}
Answer to Exercise 9.3:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.393972 0.518483 14.26 <0.0000000000000002
registered 0.184095 0.003263 56.42 <0.0000000000000002
—
the estimate value for the intercept is 7.393972 :::
Exercise 9.4 Given the dataset “Duncan” in the library “carData” estimate the regression model where the variable prestige is regressed on the variables income and education. >Which variable is the most significant?
Duncan
data descr:
The Duncan data frame has 45 rows and 4 columns. Data on the prestige and other characteristics of 45 U. S. occupations in 1950.
## install package if already dont have it
## install.packages("carData") please notice that the "D" is uppercase
library(carData)
data("Duncan")
##explore Duncan dataset
str(Duncan) ## 45 rows x 4 columns
## any prep needed
## fit model
duncan_model = lm(prestige ~ income + education, data = Duncan)
summary(duncan_model)
::: {.answer data-latex=““}
Answer to Exercise 9.4:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.06466 4.27194 -1.420 0.163
income 0.59873 0.11967 5.003 0.00001053
education 0.54583 0.09825 5.555 0.00000173
—
the most significant variables are both income and education at a 0 alpha level. It might be worth noting also that both of the two have positive signs i.e. 0.59, 0.54 which suggests that prestige is positively linked to income and education level. :::
9.3 newer execises on linear regression
They are a little more advanced wrt the ones we have faced in class. Don’t panic.
Exercise 9.5 Given the dataset state.x77
which should be already present in R perform following tasks
1. load the state datasets.
1. Convert the state.x77
dataset to a dataframe.
1. Rename the Life Exp
variable to Life.Exp
, and HS Grad
to HS.Grad
. (This avoids problems with referring to these variables when specifying a model.)
Some of the commands to do these tasks have not been covered in class, however try to solve it by yourself by looking at Google. The Google query might be something like “How to rename columns in dataframe in R”
Exercise 9.6 Suppose we wanted to enter all the variables in a first-order linear regression model with Life Expectancy as the dependent variable. 1. Fit this model
Exercise 9.7 Let’s assume that we have settled on a model that has HS.Grad
and Murder
as predictors.
- Fit this model.
- Add an interaction term to the model to previous model Exercise.
- Predict the Life Expectancy for a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.