UA MATH571A 回归分析 概念与R code总结
UA MATH571A 回歸分析 概念與R code總結
- Simple Linear Regression
- Multivariate Linear Regression
Part 0 Basic R code
Tip 1: Read in data
read.csv(“D:/Stat PhD/taking course/summer1/ref/regression/Salary1.csv”, header = TRUE, sep = “,”, quote = “”", dec = “.”, fill = TRUE, comment.char = “”)
read.csv( file.choose() )
Tip 2: R code plot()
plot(x,y,…) or plot(y~x,…) to select variables
type = , “p” (point), “l” (line), “b” (both), “c” (line connetced points), “h” (histogram)
pch = (input number 0-25, indicating 26 point characteristics)
lty = (input number 1-6, indicating 6 line types)
lwd = (input number, line width)
col = “” (input can be )
title = “”, xlab = “”, ylab = “”
xlim = c(,), ylim = c(,)
can use par( new = TRUE ) to plot two objects in one window (use the same xlim and ylim)
Tip 3: R code lm()
Tip 4: R code predict()
Simple Linear Regression
Part 1 Inference and Prediction
Tip 1: Model Assumption Yi~iidN(β0+β1Xi,σ2)Y_i\sim_{iid} N(\beta_0+\beta_1X_i,\sigma^2)Yi?~iid?N(β0?+β1?Xi?,σ2)
Tip 2: Define ki=Xi?Xˉ∑i=1n(Xi?Xˉ)2k_i = \frac{X_i-\bar{X}}{\sum_{i=1}^n (X_i-\bar{X})^2}ki?=∑i=1n?(Xi??Xˉ)2Xi??Xˉ?. Estimators of coefficients are
β^1=∑i=1nkiYi=β1+∑i=1nki?i~N(β1,σ2∑i=1n(Xi?Xˉ)2)β^0=∑i=1n(1n?kiXˉ)Yi=β0+∑i=1n(1n?kiXˉ)?i~N(β0,σ2(1n+Xˉ2∑i=1n(Xi?Xˉ)2))\hat{\beta}_1= \sum_{i=1}^{n} k_i Y_i = \beta_1 + \sum_{i=1}^{n} k_i \epsilon_i \sim N(\beta_1, \frac{\sigma^2}{\sum_{i=1}^n(X_i-\bar{X})^2} ) \\ \hat{\beta}_0 = \sum_{i=1}^{n} ( \frac{1}{n}- k_i \bar{X}) Y_i = \beta_0 + \sum_{i=1}^{n} ( \frac{1}{n}- k_i \bar{X}) \epsilon_i \sim N(\beta_0, \sigma^2 (\frac{1}{n}+\frac{ \bar{X}^2}{\sum_{i=1}^n (X_i-\bar{X})^2} )) β^?1?=i=1∑n?ki?Yi?=β1?+i=1∑n?ki??i?~N(β1?,∑i=1n?(Xi??Xˉ)2σ2?)β^?0?=i=1∑n?(n1??ki?Xˉ)Yi?=β0?+i=1∑n?(n1??ki?Xˉ)?i?~N(β0?,σ2(n1?+∑i=1n?(Xi??Xˉ)2Xˉ2?))
Tip 3: Fitted value and residuals are
Y^h=β^0+β^1Xh~N(Yh,σ2(1n+(Xh?Xˉ)2∑i=1n(Xi?Xˉ)2))ej=Yj?Y^j~N(0,σ2(1+1n+(Xj?Xˉ)2∑i=1n(Xi?Xˉ)2))\hat{Y}_h = \hat{\beta}_0 + \hat{\beta}_1 X_h\sim N(Y_h,\sigma^2 (\frac{1}{n} + \frac{(X_h - \bar{X})^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2} )) \\ e_j = Y_j - \hat Y_j \sim N(0,\sigma^2 (1+\frac{1}{n} + \frac{(X_j - \bar{X})^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2} ))Y^h?=β^?0?+β^?1?Xh?~N(Yh?,σ2(n1?+∑i=1n?(Xi??Xˉ)2(Xh??Xˉ)2?))ej?=Yj??Y^j?~N(0,σ2(1+n1?+∑i=1n?(Xi??Xˉ)2(Xj??Xˉ)2?))
Tip 4: Predictive intervals are
Var(Y^h?Yh)=σ2(1+1n+(Xh?Xˉ)2∑i=1n(Xi?Xˉ)2)Y^h?se{Y^h?Yh}t(1?α2,n?2)<Yh<Y^h+se{Y^h?Yh}t(1?α2,n?2)Var(\hat{Y}_h-Y_h)=\sigma^2 (1+\frac{1}{n} + \frac{(X_h - \bar{X})^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2} ) \\ \hat{Y}_h-se\{\hat{Y}_h-Y_h\}t(1-\frac{\alpha}{2},n-2)< Y_h < \hat{Y}_h+se\{\hat{Y}_h-Y_h\}t(1-\frac{\alpha}{2},n-2) Var(Y^h??Yh?)=σ2(1+n1?+∑i=1n?(Xi??Xˉ)2(Xh??Xˉ)2?)Y^h??se{Y^h??Yh?}t(1?2α?,n?2)<Yh?<Y^h?+se{Y^h??Yh?}t(1?2α?,n?2)
Tip 5: ANOVA
| Regression | SSR=∑i=1n(Y^i?Yˉ)2SSR=\sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2SSR=∑i=1n?(Y^i??Yˉ)2 | 1 | MSR=SSRdfRMSR = \frac{SSR}{df_R}MSR=dfR?SSR? | F=MSRMSEF = \frac{MSR}{MSE}F=MSEMSR? |
| Residuals | SSE=∑i=1n(Yi?Y^i)2SSE=\sum_{i=1}^n (Y_i - \hat{Y}_i )^2SSE=∑i=1n?(Yi??Y^i?)2 | n-2 | MSE=SSEdfEMSE = \frac{SSE}{df_E}MSE=dfE?SSE? | |
| Total | SST=∑i=1n(Yi?Yˉ)2SST=\sum_{i=1}^n (Y_i - \bar{Y})^2SST=∑i=1n?(Yi??Yˉ)2 | n-1 | MST=SSTdfTMST = \frac{SST}{df_T}MST=dfT?SST? |
MSE~χn?22,F~F(1,n?2)H0:β0=β1=0Ha:β0≠0orβ1≠0MSE \sim \chi^2_{n-2}, F \sim F(1,n-2) \\ H_0:\beta_0 = \beta_1 = 0\ \ H_a:\beta_0 \ne 0\ or\ \beta_1 \ne 0MSE~χn?22?,F~F(1,n?2)H0?:β0?=β1?=0??Ha?:β0??=0?or?β1??=0
Part 2 Residual Plots and Diagnotics
Tip 1: Residuals against sample index: to check whether there’s sequential correlation
Tip 2: Residual against independent variable: to check whether there’s missing higher order term
Tip 3: Residuals against fitted value: to check whether there’s heteroscedasticity
Tip 4: Residuals against potential independent variable: to check whether there’s important missing variable
Tip 5: Normal Probability Plot: to check whether normality holds
R code qqnorm(), qqline() to create QQ plot.
Tip 6: Shapiro-Wilk test
R code shapiro.test() to implement Shapiro-Wilk test. Accept normality for large p value.
Tip 7: Brown-Forsythe test
R code leveneTest() to implement Brown-Forsythe test to test heteroscedasticity in different groups. Accept homogeneity for large p value. Require R package car. Example,
ei = resid(Ex3.lm)
G<-(X<80)[order(X)]
group<-as.factor(G)
BF.htest = leveneTest(ei[order(X)],group)
Tip 8: Lack-of-fit F test
Fit reduced model and full model and use R code anova(Reduced, Full) to do Lack-of-fit F test. Accept reduced model for large p value.
Part 3 Remedial Methods: Box-Cox, Family-wise Adjustment and Nonparametrics
Tip 1: Box-Cox transformation
If not linearity, use Box-Cox transformation. Example
require( MASS )
Ex3.bc = boxcox( Ex3.lm,lambda=seq(-1, 1, 0.1), interp=F )
cbind( Ex3.bc$x, Ex3.bc$y )
Tip 2: Bonferroni Adjustment
Familywise alpha = # of estimators times pointwise alpha
For statistic TiT_iTi?, familywise CI is
B=t(1?α2m,n?2)[Ti?Bse(Ti),Ti+Bse(Ti)]B = t(1-\frac{\alpha}{2m},n-2) \\ [T_i - Bse(T_i),T_i + Bse(T_i)] B=t(1?2mα?,n?2)[Ti??Bse(Ti?),Ti?+Bse(Ti?)]
Tip 3: Working-Hotelling-Scheffe Bound
For statistic TiT_iTi?, familywise CI is
W=mF(1?α,m,n?2)[Ti?Wse(Ti),Ti+Wse(Ti)]W = \sqrt{mF(1-\alpha,m,n-2)}\\ [T_i - Wse(T_i),T_i + Wse(T_i)] W=mF(1?α,m,n?2)?[Ti??Wse(Ti?),Ti?+Wse(Ti?)]
α\alphaα is pointwise alpha. If m is small, use Bonferroni, otherwise use WHS.
Tip 4: Heteroscedasticity and WLS, if homogeinity of variance is rejected, use WLS.
Tip 5: Nonparametric Regression, if normality is rejected, use nonparametric regression. Use R code loess() to fit and predict() to get the nonparametrics curve. Example
baseballSLR.lm <- lm(Y ~ X)
absresid = abs(resid(baseballSLR.lm))
Yhat = fitted(baseballSLR.lm)
baseball.lo = loess( absresid~Yhat, span = 0.5, degree = 2,
family=‘symmetric’ )
Ysmooth = predict( baseball.lo,
data.frame(Yhat = seq(min(Yhat),max(Yhat),.001)) )
plot( absresid~Yhat, xlim=c(.25,.29), ylim=c(0,.11) )
par( new=TRUE )
plot( Ysmooth~seq(min(Yhat),max(Yhat),.001), type=‘l’, lwd=2,
xaxt=‘n’, yaxt=‘n’ , xlab=’’, ylab=’’, xlim=c(.25,.29), ylim=c(0,.11))
Part 4 Correlation Analysis
Tip 1: PPMCC
Tip 2: Spearman’s Rank test
Part 5 Inverse Prediction and Bootstraps
Tip 1: Inverse prediction
X^h=Yh?β^0β^1X^h?Xhse{predX}~t(N?2)s2{predX}=MSEβ^12[1+1n+X^h?Xh∑i=1N(Xi?Xˉ)2]\hat{X}_h = \frac{Y_h - \hat{\beta}_0}{\hat{\beta}_1} \\ \frac{\hat{X}_h - X_h}{se\{predX\}} \sim t(N-2) \\ s^2\{predX\} = \frac{MSE}{\hat{\beta}_1^2} [1+\frac{1}{n}+\frac{\hat{X}_h - X_h}{\sum_{i=1}^{N} (X_i - \bar{X})^2}] X^h?=β^?1?Yh??β^?0??se{predX}X^h??Xh??~t(N?2)s2{predX}=β^?12?MSE?[1+n1?+∑i=1N?(Xi??Xˉ)2X^h??Xh??]
Tip 2: Bootstrapped confidential interval: above β^1\hat\beta_1β^?1? is treated as constant which is not really true. A better way for CI is bootstrap. Example
theta_hat <- (50-finalpr2a.lm$coefficients[1])/finalpr2a.lm$coefficients[2]
ei <- resid(finalpr2a.lm)
Yhat <- fitted(finalpr2a.lm)
b1origin <- finalpr2a.lm$coefficients[2]
b0origin <- finalpr2a.lm$coefficients[1]
n <- length(Y)
B <- 5000
b0 <- numeric(B)
b1 <- numeric(B)
set.seed(2019)
for(b in 1:B){
esti <- sample(ei,n,replace=T)
Yest <- Yhat + esti
b0[b] <- coef(lm(Yest~X1))[1]
b1[b] <- coef(lm(Yest~X1))[2]
}
theta <- (50-b0)/b1
summary(theta)
theta <- sort(theta)
thetaL <- theta[126]
thetaU <- theta[4875]
c(thetaL,thetaU)
hist(theta, probability = T)
abline(v=thetaL, lty=2, lwd=2)
abline(v=thetaU, lty=2, lwd=2)
Multivariate Linear Regression
Part 1 Inference and Prediction
Tip 1: Model Assumption Y~N(Xβ,σ2In)Y \sim N(X\beta,\sigma^2I_n)Y~N(Xβ,σ2In?)
Tip 2: Estimators of coefficients are β^=(XTX)?1XTY\hat{\beta} = (X^TX)^{-1}X^TYβ^?=(XTX)?1XTY
Tip 3: Fitted value and residuals are
Tip 4: ANOVA
| Regression | SSR=β^TXTY?1NYTJYSSR=\hat{\beta}^TX^TY - \frac{1}{N}Y^TJYSSR=β^?TXTY?N1?YTJY | p-1 | MSR=SSRdfRMSR = \frac{SSR}{df_R}MSR=dfR?SSR? |
| Residuals | SSE=YTY?β^TXTYSSE=Y^TY - \hat{\beta}^TX^TYSSE=YTY?β^?TXTY | N-p | MSE=SSEdfEMSE = \frac{SSE}{df_E}MSE=dfE?SSE? |
| Total | SST=YT(IN?JN)YSST=Y^T(I_N-\frac{J}{N})YSST=YT(IN??NJ?)Y | N-1 | MST=SSTdfTMST = \frac{SST}{df_T}MST=dfT?SST? |
Tip 5: Sequential ANOVA, When given a sequence of objects, anova() tests the models against one another in the order specified.
Part 2 Variable Selection
Tip 1: Coefficient of determination. Use R code leaps(x,y,method = “adjr2”)
library(leaps)
CH09PR11.r2 <- leaps( x=cbind(X1,X2,X3,X4), y=Y, method=‘adjr2’)
p = seq( min(CH09PR11.r2$size), max(CH09PR11.r2$size) )
plot( CH09PR11.r2$adjr2 ~ CH09PR11.r2$size , ylab=expression(R^2), xlab=‘p’ )
Rp2 = by( data=CH09PR11.r2$adjr2,INDICES=factor(CH09PR11.r2$size), FUN=max )
lines( Rp2 ~ p )
CH09PR11.r2[[“adjr2”]]
CH09PR11.r2[[“which”]]
Tip 2: Mallow’s CpC_pCp?. Use R code leaps(x,y,method = “Cp”)
library(leaps)
CH09PR11.cp <- leaps( x=cbind(X1,X2,X3,X4), y=Y, method=‘Cp’)
p = seq( min(CH09PR11.cp$size), max(CH09PR11.cp$size) )
plot( CH09PR11.cp$Cp ~ CH09PR11.cp$size , ylab=expression(R^2), xlab=‘p’ )
Rp2 = by( data=CH09PR11.cp$Cp,INDICES=factor(CH09PR11.cp$size), FUN=min )
lines( Rp2 ~ p )
CH09PR11.cp[[“Cp”]]
CH09PR11.cp[[“which”]]
Tip 3: AIC and BIC. Use R code step() and input lm object to calculate AIC.
CH09PR10.lm <- lm( Y ~ X1+X2+X3+X4 )
CH09PR10.step <- step(CH09PR10.lm)
Tip 4: PRESSpPRESS_pPRESSp?. Use PRESS() to get PRESS.
library(MPV)
CH09PR23.r2 <- leaps( x=cbind(X1,X2,X3), y=Y, method=‘adjr2’)
p = seq( min(CH09PR23.r2$size), max(CH09PR23.r2$size) )
CH09PR23.r2[[“which”]]
PRESSp = numeric( length(CH09PR23.r2$size) )
SSEp = numeric( length(CH09PR23.r2$size) )
PRESSp[1] = PRESS( lm( Y ~ X1 ) )
PRESSp[2] = PRESS( lm( Y ~ X2 ) )
PRESSp[3] = PRESS( lm( Y ~ X3 ) )
PRESSp[4] = PRESS( lm( Y ~ X1+X2 ) )
PRESSp[5] = PRESS( lm( Y ~ X1+X3 ) )
PRESSp[6] = PRESS( lm( Y ~ X2+X3 ) )
PRESSp[7] = PRESS( lm( Y ~ X1+X2+X3 ) )
PRESSp[8] = PRESS( lm( Y ~ X1+X2+X1sq+X2sq ) )
PRESSp[9] = PRESS( lm( Y ~ X1+X2+X1X2 ) )
PRESSp[10] = PRESS( lm( Y ~ X1+X2+X1sq+X2X3 ) )
SSEp[1] = anova(lm(Y~X1))[2,2]
SSEp[2] = anova(lm(Y~X2))[2,2]
SSEp[3] = anova(lm(Y~X3))[2,2]
SSEp[4] = anova(lm(Y~X1+X2))[3,2]
SSEp[5] = anova(lm(Y~X1+X3))[3,2]
SSEp[6] = anova(lm(Y~X2+X3))[3,2]
SSEp[7] = anova(lm(Y~X1+X2+X3))[4,2]
SSEp[8] = anova(lm(Y~X1+X2+X1sq+X2sq))[5,2]
SSEp[9] = anova(lm(Y~X1+X2+X1X2))[4,2]
SSEp[10] = anova(lm(Y~X1+X2+X1sq+X2X3))[5,2]
plot( PRESSp[1:7] ~ CH09PR23.r2$size , ylab=expression(PRESS[p]), xlab=‘p’ )
minPRESSp = by( data=PRESSp[1:7],INDICES=factor(CH09PR23.r2$size), FUN=min )
lines(minPRESSp ~ p )
Tip 5: Forward Stepwise Regression
library(rms)
step( lm(Y~1),Y~X1+X2+X3+X4+X51999+X52000+X52001, direction=‘forward’)
Tip 6: Backward Elimination. Use fastbw(). k is the multiple of the number of degrees of freedom used for the penalty. k = 2 for AIC and k = log(n) for BIC (n = length(Y))
library(rms)
CH09PR20.lm <- lm(Y~X1+X2+X3+X4+X51999+X52000+X52001)
step( CH09PR20.lm, direction=‘backward’,k=2)
CH09PR20.ols <- ols( Y~X1+X2+X3+X4+X51999+X52000+X52001 )
fastbw( fit=CH09PR20.ols, rule=‘p’,type=‘individual’, sls=.10 )
Part 3 Outliears Detection
Tip 1: Studentized Detected Residual: internally studentized residual (fitting) and externally studentized residual (LOOCV). R code rstudent() for externally studentized residual.
tcrit <- qt( 1-.5*(.05/52), 52-4-1 ) # Bonferroni adjustment needed
CH10PR10.lm <- lm( Y ~ X)
plot( rstudent(CH10PR10.lm) ~ fitted(CH10PR10.lm), ylim=c(-4,4) )
abline( h=tcrit , lty=2 )
abline( h=-tcrit, lty=2 )
abline( h=0 )
Tip 2: Hat Matrix Leverage Value. Use hatvalues() to find hat value. Fails when number of predictors is large.
n <- length(Y) # sample size
p <- 3 # number of predictors
hii = hatvalues( CH10PR10.lm )
hii>2*p/n # Empirical Rule to determine outliers.
Part 4 Influential Analysis
Tip 1: DFFITS. The influence of the i-th data to i-th fitted value. Use R code dffits().
Tip 2: Cook’s Distance. The influence of the i-th data to all fitted value. Use R code influence.measures().
Tip 3: DFBETAS. The influence of the i-th data to k-th coefficients. Use R code cooks.distance(). Input is model object.
DFF <- dffits(CH10PR10.lm)
abs(DFF)>2*sqrt(p/n)
DFB <- influence.measures( CH10PR10.lm )
DFB[[“infmat”]][c(16,22,43,48,10,32,38,40),]
plot( cooks.distance(CH10PR10.lm), type=‘o’,pch=19 )
ei = resid( CH10PR10.lm )
yhat = fitted(CH10PR10.lm)
radius = sqrt( cooks.distance(CH10PR10.lm)/pi )
plot( ei ~ yhat, pch=’’)
abline( h=0 )
symbols( yhat,ei,circles=radius,inches=.15,bg=‘black’,fg=‘white’,add=T )
Part 5 Multicolinearity and Ridge Regression
Tip 1: Variance Inflation Factor
require( car )
vif( lm(Y ~ X1+X2+X3) )
Tip 2: Ridge Regression. Use R package genridge.
c = seq( from=.01,to=10,by=.01 )
traceplot( ridge(U~Z1+Z2+Z3, lambda=c) )
總結
以上是生活随笔為你收集整理的UA MATH571A 回归分析 概念与R code总结的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: UA MATH566 统计理论 Baye
- 下一篇: UA MATH566 统计理论 推导卡方