Last updated: 2020-05-29

This summarizes key concepts and directions for performing linear regression. Most of the steps are taken from Duke University’s Linear Regression and Modeling course on coursera.

Linear Regression

1. Correlation

  • correlation is the strength of linear association

  • correlation coefficients are sensitive to outliers

\(R = cor(x,y). R^2 = (correlation)^2\)

This is the correlation code for a table (x=temp, y=sound).

cor <- cricket %>% 
  summarise(r=cor(sound, temp)) %>% 
[1] 0.8351438

This is the scatterplot to see the points.

ggplot(cricket, aes(x=temp, y=sound))+
  geom_smooth(method = "lm", se=F)

Version Author Date
8ee3b5d KaranSShakya 2020-05-27

2. Residuals

Residuals are the difference between observed and predicted values. To visualize this we have used the broom package to test the residuals.

\(Residuals (errors) = observed - predicted\)

lm <- lm(sound~temp, data=cricket)
lm.table <- augment(lm) #can visualize all the residuals in a table form

ggplot(lm.table, aes(x=.fitted, y=.resid))+ geom_point(alpha=0.5)

3. Least Square Lines

Best way to have a linear regression line is to minimize the sum of squared residuals.

\(Slope(b_1 = SD_y/SD_x * R)\) <- lm.table %>% 
  summarize(,, cor=cor(sound, temp)) %>% 
  mutate(slope=(*cor) #Slope = 0.211

When we look at the lm model, the slope is also 0.211.


4. Conditions for Linear Regression

a. Linearity (scatterplot + residual plot - residuals needs to be random)

b. Nearly normal residuals (histogram of residuals or QQ residual plot)

c. Constant variability (residual plot)

Link for interactive regression diagnostic test.

a <- ggplot(lm.table, aes(x=.fitted, y=.resid))+
  geom_hline(yintercept = 0, linetype="dashed", color="red")+
  labs(title="Residuals vs Fitted Values", x="Fitted Values", y="Residuals")
b <- ggplot(lm.table, aes(x=.resid))+
  labs(title="Histogram of residuals", x="Residuals") #geom_density can also be added
c <- ggplot(lm.table, aes(sample=.resid))+
grid.arrange(a, b, c, ncol=3)

Version Author Date
11e02b8 KaranSShakya 2020-05-27

5. Inference

  • Hypothesis testing on the slope to identify if the explanatory variable is a significant predictor.

  • Null hyp: H0 = 0 (no relationship). Alt hyp: H1 not 0 (yes relationship).

\(t-stat = (pointestimate - null value) / SE\)


lm(formula = sound ~ temp, data = cricket)

     Min       1Q   Median       3Q      Max 
-1.56009 -0.57930  0.03129  0.59020  1.53259 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.30914    3.10858  -0.099 0.922300    
temp         0.21192    0.03871   5.475 0.000107 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9715 on 13 degrees of freedom
Multiple R-squared:  0.6975,    Adjusted R-squared:  0.6742 
F-statistic: 29.97 on 1 and 13 DF,  p-value: 0.0001067

t value can be foudn by: (0.211 - 0) / 0.039 = 5.4

For 95% confidence interval (CI): 0.211 +- 2.06 x 0.0387 = (0.13, 0.29)


Analysis of Variance

Analysis of Variance Table

Response: sound
          Df Sum Sq Mean Sq F value    Pr(>F)    
temp       1 28.287 28.2873   29.97 0.0001067 ***
Residuals 13 12.270  0.9438                      
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • F-Value: F-Statistics: MeanSQ (temp) / MeanSQ (residuals)

\(R^2 = SS(reg)/SS(total) = 28.287/30.5\)

7. Multi-Variable Linear Regression

  • R-square will always increase with every inclusion.

  • For multiple variables, adjusted R-square is important.

