Chapter 10 Regression

df <- as_tibble(Spruce)
df%>%group_by(Competition, Fertilizer)%>%summarise(n())
## `summarise()` has grouped output by 'Competition'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups:   Competition [2]
##   Competition Fertilizer `n()`
##   <fct>       <fct>      <int>
## 1 C           F             18
## 2 C           NF            18
## 3 NC          F             18
## 4 NC          NF            18

df %>%mutate(diam_delta = Diameter5 - Diameter0)%>%
    ggplot(aes(y=diam_delta, x = Ht.change)) + geom_point()

df %>%mutate(diam_delta = Diameter5 - Diameter0)%>%summarise(rho = cor(diam_delta, Ht.change))
## # A tibble: 1 x 1
##     rho
##   <dbl>
## 1 0.902

df %>%mutate(diam_delta = Diameter5 - Diameter0)%>%do(tidy(lm(diam_delta~Ht.change,.)))
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -0.519   0.274       -1.89 6.23e- 2
## 2 Ht.change      0.146   0.00835     17.5  2.98e-27

Instead of looking at regular lm function, I looked up a way for a way to perform via the tidyverse framework. Luckily I chanced upon the broom package. This was one section that I had missed going through. Hence this is a nice motivation for me to go through broom package.

I should go over Hadley Wichkam’s lecture on broom package where he mentions that using broom makes it super easy to test out various models

df %>%mutate(diam_delta = Diameter5 - Diameter0)%>%
    do(tidy(lm(diam_delta~Ht.change+Fertilizer+Competition,.)))
## # A tibble: 4 x 5
##   term          estimate std.error statistic  p.value
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      1.05     0.466       2.25 2.78e- 2
## 2 Ht.change        0.104    0.0134      7.75 6.14e-11
## 3 FertilizerNF    -1.03     0.259      -3.97 1.76e- 4
## 4 CompetitionNC    0.490    0.218       2.24 2.82e- 2

model <- lm(Free~Short, Skating2010)
summary(model)
## 
## Call:
## lm(formula = Free ~ Short, data = Skating2010)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.314  -6.780   0.710   6.407  21.205 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.9691    18.1175   0.440    0.664    
## Short         1.7347     0.2424   7.157 3.56e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.36 on 22 degrees of freedom
## Multiple R-squared:  0.6995, Adjusted R-squared:  0.6859 
## F-statistic: 51.22 on 1 and 22 DF,  p-value: 3.562e-07
new <- data.frame(Short = 60)
predict(model, new, interval="confidence")
##        fit      lwr      upr
## 1 112.0494 103.4712 120.6277
predict(model, new, interval="prediction")
##        fit      lwr      upr
## 1 112.0494 86.98175 137.1171

10.1 Resampling CI

pred_vals <- replicate(10^4,{
    idx <- sample(1:24,24, T)
    fitted <- lm(Free~Short, Skating2010[idx,])
    new <- data.frame(Short = 60)
    predict(fitted, new)
})
hist(pred_vals)

quantile(pred_vals, probs=c(0.025, 0.975))
##     2.5%    97.5% 
## 102.7506 120.1658

10.2 Resampling Correlation

pred_vals <- replicate(10^4,{
    idx <- sample(1:24,24, T)
    rho_bs <- cor(Skating2010$Free, Skating2010$Short[idx])
    rho_bs
})
mean(pred_vals > cor(Skating2010$Free, Skating2010$Short))
## [1] 0

Permute one of the independent variables, compute the correlation for a large number of bootstrap samples and check whether the realized \(p-value\) is statistically significant

glm(Alcohol~ Age, data = Fatalities, family=binomial)
## 
## Call:  glm(formula = Alcohol ~ Age, family = binomial, data = Fatalities)
## 
## Coefficients:
## (Intercept)          Age  
##    -0.12262     -0.02898  
## 
## Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
## Null Deviance:       105.4 
## Residual Deviance: 100.2     AIC: 104.2

10.3 Takeaways

Linear Model means that the conditional distribution model is a linear model
One must understand the difference between prediction intervals and confidence intervals
One can use bootstrap and obtain the sampling distribution of the various coefficients in the multiple regression model
Assumptions of the linear model
- \(x\) values are fixed
- relationship between the variables is linear
- residuals are independent
- residuals have constant variance
- residuals are normally distributed