當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

UA MATH571A 回归分析概念与R code总结

發布時間：2025/4/14 编程问答 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 UA MATH571A 回归分析概念与R code总结小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

UA MATH571A 回歸分析概念與R code總結

- - Simple Linear Regression
  - Multivariate Linear Regression

Part 0 Basic R code
Tip 1: Read in data

Based on path of the data file
read.csv(“D:/Stat PhD/taking course/summer1/ref/regression/Salary1.csv”, header = TRUE, sep = “,”, quote = “”", dec = “.”, fill = TRUE, comment.char = “”)

Choosing the data file from dictionary (Suggested)
read.csv( file.choose() )

Tip 2: R code plot()

plot(x,y,…) or plot(y~x,…) to select variables

type = , “p” (point), “l” (line), “b” (both), “c” (line connetced points), “h” (histogram)

pch = (input number 0-25, indicating 26 point characteristics)

lty = (input number 1-6, indicating 6 line types)

lwd = (input number, line width)

col = “” (input can be )

title = “”, xlab = “”, ylab = “”

xlim = c(,), ylim = c(,)

can use par( new = TRUE ) to plot two objects in one window (use the same xlim and ylim)

Tip 3: R code lm()

lm(formula, weights) will return a model object (a list). Use summary() to take a glance.

formula: basic form y~model, model can be variables with +, :, *. + indicates follow, i.e. y~a+b means using variable b following a to regress y. : indicates interaction. * indicates full model, i.e a*b means a + b + a:b. -1 indicates regression through the origin and +0 indicates regression without intercept.

If no value for weights, indicates OLS. weights = numeric vector indicates WLS

Tip 4: R code predict()

model object (a list), can be lm object, or loess object.

data.frame(model = ), input data for model variables in y~model

interval = “” (conf or pred)

level = (input number for alpha)

Simple Linear Regression

Part 1 Inference and Prediction
Tip 1: Model Assumption $Yi～iidN(β0+β1Xi,σ2)Y_i\sim_{iid} N(\beta_0+\beta_1X_i,\sigma^2)$
Tip 2: Define $ki=Xi?Xˉ∑i=1n(Xi?Xˉ)2k_i = \frac{X_i-\bar{X}}{\sum_{i=1}^n (X_i-\bar{X})^2}$ . Estimators of coefficients are
$β^1=∑i=1nkiYi=β1+∑i=1nki?i～N(β1,σ2∑i=1n(Xi?Xˉ)2)β^0=∑i=1n(1n?kiXˉ)Yi=β0+∑i=1n(1n?kiXˉ)?i～N(β0,σ2(1n+Xˉ2∑i=1n(Xi?Xˉ)2))\hat{\beta}_1= \sum_{i=1}^{n} k_i Y_i = \beta_1 + \sum_{i=1}^{n} k_i \epsilon_i \sim N(\beta_1, \frac{\sigma^2}{\sum_{i=1}^n(X_i-\bar{X})^2} ) \\ \hat{\beta}_0 = \sum_{i=1}^{n} ( \frac{1}{n}- k_i \bar{X}) Y_i = \beta_0 + \sum_{i=1}^{n} ( \frac{1}{n}- k_i \bar{X}) \epsilon_i \sim N(\beta_0, \sigma^2 (\frac{1}{n}+\frac{ \bar{X}^2}{\sum_{i=1}^n (X_i-\bar{X})^2} ))$

Tip 3: Fitted value and residuals are
$Y^h=β^0+β^1Xh～N(Yh,σ2(1n+(Xh?Xˉ)2∑i=1n(Xi?Xˉ)2))ej=Yj?Y^j～N(0,σ2(1+1n+(Xj?Xˉ)2∑i=1n(Xi?Xˉ)2))\hat{Y}_h = \hat{\beta}_0 + \hat{\beta}_1 X_h\sim N(Y_h,\sigma^2 (\frac{1}{n} + \frac{(X_h - \bar{X})^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2} )) \\ e_j = Y_j - \hat Y_j \sim N(0,\sigma^2 (1+\frac{1}{n} + \frac{(X_j - \bar{X})^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2} ))$

Tip 4: Predictive intervals are
$Var(Y^h?Yh)=σ2(1+1n+(Xh?Xˉ)2∑i=1n(Xi?Xˉ)2)Y^h?se{Y^h?Yh}t(1?α2,n?2)<Yh<Y^h+se{Y^h?Yh}t(1?α2,n?2)Var(\hat{Y}_h-Y_h)=\sigma^2 (1+\frac{1}{n} + \frac{(X_h - \bar{X})^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2} ) \\ \hat{Y}_h-se\{\hat{Y}_h-Y_h\}t(1-\frac{\alpha}{2},n-2)< Y_h < \hat{Y}_h+se\{\hat{Y}_h-Y_h\}t(1-\frac{\alpha}{2},n-2)$

Tip 5: ANOVA

SourceSSdfMSF

Regression	$SSR=∑i=1n(Y^i?Yˉ)2SSR=\sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2$	1	$\frac{SSR}{df_R}$	$\frac{MSR}{MSE}$
Residuals	$SSE=∑i=1n(Yi?Y^i)2SSE=\sum_{i=1}^n (Y_i - \hat{Y}_i )^2$	n-2	$\frac{SSE}{df_E}$
Total	$SST=∑i=1n(Yi?Yˉ)2SST=\sum_{i=1}^n (Y_i - \bar{Y})^2$	n-1	$\frac{SST}{df_T}$

$\sim \chi^2_{n-2}, F \sim F(1,n-2) \\ H_0:\beta_0 = \beta_1 = 0\ \ H_a:\beta_0 \ne 0\ or\ \beta_1 \ne 0$

Part 2 Residual Plots and Diagnotics
Tip 1: Residuals against sample index: to check whether there’s sequential correlation

Tip 2: Residual against independent variable: to check whether there’s missing higher order term

Tip 3: Residuals against fitted value: to check whether there’s heteroscedasticity

Tip 4: Residuals against potential independent variable: to check whether there’s important missing variable

Tip 5: Normal Probability Plot: to check whether normality holds
R code qqnorm(), qqline() to create QQ plot.

Tip 6: Shapiro-Wilk test
R code shapiro.test() to implement Shapiro-Wilk test. Accept normality for large p value.

Tip 7: Brown-Forsythe test
R code leveneTest() to implement Brown-Forsythe test to test heteroscedasticity in different groups. Accept homogeneity for large p value. Require R package car. Example,

ei = resid(Ex3.lm)
G<-(X<80)[order(X)]
group<-as.factor(G)
BF.htest = leveneTest(ei[order(X)],group)

Tip 8: Lack-of-fit F test
Fit reduced model and full model and use R code anova(Reduced, Full) to do Lack-of-fit F test. Accept reduced model for large p value.

Part 3 Remedial Methods: Box-Cox, Family-wise Adjustment and Nonparametrics
Tip 1: Box-Cox transformation
If not linearity, use Box-Cox transformation. Example

require( MASS )
Ex3.bc = boxcox( Ex3.lm,lambda=seq(-1, 1, 0.1), interp=F )
cbind( Ex3.bc$x, Ex3.bc$y )

Tip 2: Bonferroni Adjustment
Familywise alpha = # of estimators times pointwise alpha
For statistic $T_i$ , familywise CI is
$t(1-\frac{\alpha}{2m},n-2) \\ [T_i - Bse(T_i),T_i + Bse(T_i)]$

Tip 3: Working-Hotelling-Scheffe Bound
For statistic $T_i$ , familywise CI is
$\sqrt{mF(1-\alpha,m,n-2)}\\ [T_i - Wse(T_i),T_i + Wse(T_i)]$

$α\alpha$ is pointwise alpha. If m is small, use Bonferroni, otherwise use WHS.

Tip 4: Heteroscedasticity and WLS, if homogeinity of variance is rejected, use WLS.

Tip 5: Nonparametric Regression, if normality is rejected, use nonparametric regression. Use R code loess() to fit and predict() to get the nonparametrics curve. Example

baseballSLR.lm <- lm(Y ~ X)
absresid = abs(resid(baseballSLR.lm))
Yhat = fitted(baseballSLR.lm)
baseball.lo = loess( absresid~Yhat, span = 0.5, degree = 2,
family=‘symmetric’ )
Ysmooth = predict( baseball.lo,
data.frame(Yhat = seq(min(Yhat),max(Yhat),.001)) )
plot( absresid~Yhat, xlim=c(.25,.29), ylim=c(0,.11) )
par( new=TRUE )
plot( Ysmooth~seq(min(Yhat),max(Yhat),.001), type=‘l’, lwd=2,
xaxt=‘n’, yaxt=‘n’ , xlab=’’, ylab=’’, xlim=c(.25,.29), ylim=c(0,.11))

Part 4 Correlation Analysis
Tip 1: PPMCC
Tip 2: Spearman’s Rank test

Part 5 Inverse Prediction and Bootstraps
Tip 1: Inverse prediction
$X^h=Yh?β^0β^1X^h?Xhse{predX}～t(N?2)s2{predX}=MSEβ^12[1+1n+X^h?Xh∑i=1N(Xi?Xˉ)2]\hat{X}_h = \frac{Y_h - \hat{\beta}_0}{\hat{\beta}_1} \\ \frac{\hat{X}_h - X_h}{se\{predX\}} \sim t(N-2) \\ s^2\{predX\} = \frac{MSE}{\hat{\beta}_1^2} [1+\frac{1}{n}+\frac{\hat{X}_h - X_h}{\sum_{i=1}^{N} (X_i - \bar{X})^2}]$

Tip 2: Bootstrapped confidential interval: above $β^1\hat\beta_1$ is treated as constant which is not really true. A better way for CI is bootstrap. Example

theta_hat <- (50-finalpr2a.lm$coefficients[1])/finalpr2a.lm$coefficients[2]
ei <- resid(finalpr2a.lm)
Yhat <- fitted(finalpr2a.lm)
b1origin <- finalpr2a.lm$coefficients[2]
b0origin <- finalpr2a.lm$coefficients[1]
n <- length(Y)
B <- 5000
b0 <- numeric(B)
b1 <- numeric(B)
set.seed(2019)
for(b in 1:B){
esti <- sample(ei,n,replace=T)
Yest <- Yhat + esti
b0[b] <- coef(lm(Yest~X1))[1]
b1[b] <- coef(lm(Yest~X1))[2]
}
theta <- (50-b0)/b1
summary(theta)
theta <- sort(theta)
thetaL <- theta[126]
thetaU <- theta[4875]
c(thetaL,thetaU)
hist(theta, probability = T)
abline(v=thetaL, lty=2, lwd=2)
abline(v=thetaU, lty=2, lwd=2)

Multivariate Linear Regression

Part 1 Inference and Prediction
Tip 1: Model Assumption $\sim N(X\beta,\sigma^2I_n)$
Tip 2: Estimators of coefficients are $β^=(XTX)?1XTY\hat{\beta} = (X^TX)^{-1}X^TY$
Tip 3: Fitted value and residuals are

Hat matrix

H = X(X^TX)^{-1}X^T

Fitted value

Y^=HY\hat Y = HY

Residual

e = (I_n - H)Y

Tip 4: ANOVA

SourceSSdfMS

Regression	$SSR=β^TXTY?1NYTJYSSR=\hat{\beta}^TX^TY - \frac{1}{N}Y^TJY$	p-1	$\frac{SSR}{df_R}$
Residuals	$SSE=YTY?β^TXTYSSE=Y^TY - \hat{\beta}^TX^TY$	N-p	$\frac{SSE}{df_E}$
Total	$SST=YT(IN?JN)YSST=Y^T(I_N-\frac{J}{N})Y$	N-1	$\frac{SST}{df_T}$

Tip 5: Sequential ANOVA, When given a sequence of objects, anova() tests the models against one another in the order specified.

Part 2 Variable Selection
Tip 1: Coefficient of determination. Use R code leaps(x,y,method = “adjr2”)

library(leaps)
CH09PR11.r2 <- leaps( x=cbind(X1,X2,X3,X4), y=Y, method=‘adjr2’)
p = seq( min(CH09PR11.r2$size), max(CH09PR11.r2$size) )
plot( CH09PR11.r2$adjr2 ~ CH09PR11.r2$size , ylab=expression(R^2), xlab=‘p’ )
Rp2 = by( data=CH09PR11.r2$adjr2,INDICES=factor(CH09PR11.r2$size), FUN=max )
lines( Rp2 ~ p )
CH09PR11.r2[[“adjr2”]]
CH09PR11.r2[[“which”]]

Tip 2: Mallow’s $C_p$ . Use R code leaps(x,y,method = “Cp”)

library(leaps)
CH09PR11.cp <- leaps( x=cbind(X1,X2,X3,X4), y=Y, method=‘Cp’)
p = seq( min(CH09PR11.cp$size), max(CH09PR11.cp$size) )
plot( CH09PR11.cp$Cp ~ CH09PR11.cp$size , ylab=expression(R^2), xlab=‘p’ )
Rp2 = by( data=CH09PR11.cp$Cp,INDICES=factor(CH09PR11.cp$size), FUN=min )
lines( Rp2 ~ p )
CH09PR11.cp[[“Cp”]]
CH09PR11.cp[[“which”]]

Tip 3: AIC and BIC. Use R code step() and input lm object to calculate AIC.

CH09PR10.lm <- lm( Y ~ X1+X2+X3+X4 )
CH09PR10.step <- step(CH09PR10.lm)

Tip 4: $PRESS_p$ . Use PRESS() to get PRESS.

library(MPV)
CH09PR23.r2 <- leaps( x=cbind(X1,X2,X3), y=Y, method=‘adjr2’)
p = seq( min(CH09PR23.r2$size), max(CH09PR23.r2$size) )
CH09PR23.r2[[“which”]]
PRESSp = numeric( length(CH09PR23.r2$size) )
SSEp = numeric( length(CH09PR23.r2$size) )
PRESSp[1] = PRESS( lm( Y ~ X1 ) )
PRESSp[2] = PRESS( lm( Y ~ X2 ) )
PRESSp[3] = PRESS( lm( Y ~ X3 ) )
PRESSp[4] = PRESS( lm( Y ~ X1+X2 ) )
PRESSp[5] = PRESS( lm( Y ~ X1+X3 ) )
PRESSp[6] = PRESS( lm( Y ~ X2+X3 ) )
PRESSp[7] = PRESS( lm( Y ~ X1+X2+X3 ) )
PRESSp[8] = PRESS( lm( Y ~ X1+X2+X1sq+X2sq ) )
PRESSp[9] = PRESS( lm( Y ~ X1+X2+X1X2 ) )
PRESSp[10] = PRESS( lm( Y ~ X1+X2+X1sq+X2X3 ) )
SSEp[1] = anova(lm(Y~X1))[2,2]
SSEp[2] = anova(lm(Y~X2))[2,2]
SSEp[3] = anova(lm(Y~X3))[2,2]
SSEp[4] = anova(lm(Y~X1+X2))[3,2]
SSEp[5] = anova(lm(Y~X1+X3))[3,2]
SSEp[6] = anova(lm(Y~X2+X3))[3,2]
SSEp[7] = anova(lm(Y~X1+X2+X3))[4,2]
SSEp[8] = anova(lm(Y~X1+X2+X1sq+X2sq))[5,2]
SSEp[9] = anova(lm(Y~X1+X2+X1X2))[4,2]
SSEp[10] = anova(lm(Y~X1+X2+X1sq+X2X3))[5,2]
plot( PRESSp[1:7] ~ CH09PR23.r2$size , ylab=expression(PRESS[p]), xlab=‘p’ )
minPRESSp = by( data=PRESSp[1:7],INDICES=factor(CH09PR23.r2$size), FUN=min )
lines(minPRESSp ~ p )

Tip 5: Forward Stepwise Regression

library(rms)
step( lm(Y~1),Y~X1+X2+X3+X4+X51999+X52000+X52001, direction=‘forward’)

Tip 6: Backward Elimination. Use fastbw(). k is the multiple of the number of degrees of freedom used for the penalty. k = 2 for AIC and k = log(n) for BIC (n = length(Y))

library(rms)
CH09PR20.lm <- lm(Y~X1+X2+X3+X4+X51999+X52000+X52001)
step( CH09PR20.lm, direction=‘backward’,k=2)
CH09PR20.ols <- ols( Y~X1+X2+X3+X4+X51999+X52000+X52001 )
fastbw( fit=CH09PR20.ols, rule=‘p’,type=‘individual’, sls=.10 )

Part 3 Outliears Detection
Tip 1: Studentized Detected Residual: internally studentized residual (fitting) and externally studentized residual (LOOCV). R code rstudent() for externally studentized residual.

tcrit <- qt( 1-.5*(.05/52), 52-4-1 ) # Bonferroni adjustment needed
CH10PR10.lm <- lm( Y ~ X)
plot( rstudent(CH10PR10.lm) ~ fitted(CH10PR10.lm), ylim=c(-4,4) )
abline( h=tcrit , lty=2 )
abline( h=-tcrit, lty=2 )
abline( h=0 )

Tip 2: Hat Matrix Leverage Value. Use hatvalues() to find hat value. Fails when number of predictors is large.

n <- length(Y) # sample size
p <- 3 # number of predictors
hii = hatvalues( CH10PR10.lm )
hii>2*p/n # Empirical Rule to determine outliers.

Part 4 Influential Analysis
Tip 1: DFFITS. The influence of the i-th data to i-th fitted value. Use R code dffits().
Tip 2: Cook’s Distance. The influence of the i-th data to all fitted value. Use R code influence.measures().
Tip 3: DFBETAS. The influence of the i-th data to k-th coefficients. Use R code cooks.distance(). Input is model object.

DFF <- dffits(CH10PR10.lm)
abs(DFF)>2*sqrt(p/n)
DFB <- influence.measures( CH10PR10.lm )
DFB[[“infmat”]][c(16,22,43,48,10,32,38,40),]
plot( cooks.distance(CH10PR10.lm), type=‘o’,pch=19 )
ei = resid( CH10PR10.lm )
yhat = fitted(CH10PR10.lm)
radius = sqrt( cooks.distance(CH10PR10.lm)/pi )
plot( ei ~ yhat, pch=’’)
abline( h=0 )
symbols( yhat,ei,circles=radius,inches=.15,bg=‘black’,fg=‘white’,add=T )

Part 5 Multicolinearity and Ridge Regression
Tip 1: Variance Inflation Factor

Use pairs() and cor() to get pairwise dot plots and correlation matrix.

require( car )
vif( lm(Y ~ X1+X2+X3) )

If max(VIF)>10 and mean(VIF)>6, suggesting Multicolinearity.

Tip 2: Ridge Regression. Use R package genridge.

c = seq( from=.01,to=10,by=.01 )
traceplot( ridge(U~Z1+Z2+Z3, lambda=c) )

總結

以上是生活随笔為你收集整理的UA MATH571A 回归分析概念与R code总结的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： UA MATH566 统计理论 Baye
下一篇： UA MATH566 统计理论推导卡方

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

UA MATH571A 回归分析 概念与R code总结

UA MATH571A 回歸分析 概念與R code總結

Simple Linear Regression

Multivariate Linear Regression

總結

UA MATH571A 回归分析概念与R code总结

UA MATH571A 回歸分析概念與R code總結