Chapter 10 Regression
<- as_tibble(Spruce)
df %>%group_by(Competition, Fertilizer)%>%summarise(n())
df## `summarise()` has grouped output by 'Competition'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups: Competition [2]
## Competition Fertilizer `n()`
## <fct> <fct> <int>
## 1 C F 18
## 2 C NF 18
## 3 NC F 18
## 4 NC NF 18
%>%mutate(diam_delta = Diameter5 - Diameter0)%>%
df ggplot(aes(y=diam_delta, x = Ht.change)) + geom_point()
%>%mutate(diam_delta = Diameter5 - Diameter0)%>%summarise(rho = cor(diam_delta, Ht.change))
df ## # A tibble: 1 x 1
## rho
## <dbl>
## 1 0.902
%>%mutate(diam_delta = Diameter5 - Diameter0)%>%do(tidy(lm(diam_delta~Ht.change,.)))
df ## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.519 0.274 -1.89 6.23e- 2
## 2 Ht.change 0.146 0.00835 17.5 2.98e-27
Instead of looking at regular lm
function, I looked up a way for a way to
perform via the tidyverse framework. Luckily I chanced upon the broom
package. This was one section that I had missed going through. Hence this is a
nice motivation for me to go through broom
package.
I should go over Hadley Wichkam’s lecture on broom
package where he mentions
that using broom
makes it super easy to test out various models
%>%mutate(diam_delta = Diameter5 - Diameter0)%>%
df do(tidy(lm(diam_delta~Ht.change+Fertilizer+Competition,.)))
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.05 0.466 2.25 2.78e- 2
## 2 Ht.change 0.104 0.0134 7.75 6.14e-11
## 3 FertilizerNF -1.03 0.259 -3.97 1.76e- 4
## 4 CompetitionNC 0.490 0.218 2.24 2.82e- 2
<- lm(Free~Short, Skating2010)
model summary(model)
##
## Call:
## lm(formula = Free ~ Short, data = Skating2010)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.314 -6.780 0.710 6.407 21.205
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.9691 18.1175 0.440 0.664
## Short 1.7347 0.2424 7.157 3.56e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.36 on 22 degrees of freedom
## Multiple R-squared: 0.6995, Adjusted R-squared: 0.6859
## F-statistic: 51.22 on 1 and 22 DF, p-value: 3.562e-07
<- data.frame(Short = 60)
new predict(model, new, interval="confidence")
## fit lwr upr
## 1 112.0494 103.4712 120.6277
predict(model, new, interval="prediction")
## fit lwr upr
## 1 112.0494 86.98175 137.1171
10.1 Resampling CI
<- replicate(10^4,{
pred_vals <- sample(1:24,24, T)
idx <- lm(Free~Short, Skating2010[idx,])
fitted <- data.frame(Short = 60)
new predict(fitted, new)
})hist(pred_vals)
quantile(pred_vals, probs=c(0.025, 0.975))
## 2.5% 97.5%
## 102.7506 120.1658
10.2 Resampling Correlation
<- replicate(10^4,{
pred_vals <- sample(1:24,24, T)
idx <- cor(Skating2010$Free, Skating2010$Short[idx])
rho_bs
rho_bs
})mean(pred_vals > cor(Skating2010$Free, Skating2010$Short))
## [1] 0
Permute one of the independent variables, compute the correlation for a large number of bootstrap samples and check whether the realized \(p-value\) is statistically significant
glm(Alcohol~ Age, data = Fatalities, family=binomial)
##
## Call: glm(formula = Alcohol ~ Age, family = binomial, data = Fatalities)
##
## Coefficients:
## (Intercept) Age
## -0.12262 -0.02898
##
## Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
## Null Deviance: 105.4
## Residual Deviance: 100.2 AIC: 104.2
10.3 Takeaways
- Linear Model means that the conditional distribution model is a linear model
- One must understand the difference between prediction intervals and confidence intervals
- One can use bootstrap and obtain the sampling distribution of the various coefficients in the multiple regression model
- Assumptions of the linear model
- \(x\) values are fixed
- relationship between the variables is linear
- residuals are independent
- residuals have constant variance
- residuals are normally distributed