Casual inference 综述框架
A survey on causal inference
因果推理綜述——《A Survey on Causal Inference》一文的總結(jié)和梳理
因果推斷理論解讀
Rubin因果模型的三個(gè)假設(shè)
基礎(chǔ)理論
理論框架
名詞解釋
- individual treatment effect :ITE = Y1i?Y0iY_{1i}-Y_{0i}Y1i??Y0i?
- average treatment effect :ATE= E(Y1i?Y0i)E(Y_{1i}-Y_{0i})E(Y1i??Y0i?)
- conditional treatment effect :CAPE = E(Y1i?Y0i∣X)E(Y_{1i}-Y_{0i}|X)E(Y1i??Y0i?∣X)
兩個(gè)挑戰(zhàn)
- counterfact:無法觀測(cè)到反事實(shí)數(shù)據(jù)
- confounder bias:treatment不是隨機(jī)分配
1 Rubin Causal Model(RCM)
potential outcome model (虛擬事實(shí)模型 ),也叫做Rubin Causal Model(RCM),希望估計(jì)出每個(gè)unit或者整體平均意義下的potential outcome,進(jìn)而得到干預(yù)效果treatment effect(eg. ITE/ATE)。
因此準(zhǔn)確地估計(jì)出potential outcome是該框架的關(guān)鍵,由于混雜因子confounder的存在,觀察到的數(shù)據(jù)不用直接用來近似potential outcome,需要有進(jìn)一步的處理。
核心思想:準(zhǔn)確估計(jì)potential outcome,尋找對(duì)照組
- matching:根據(jù)傾向得分,找到最佳對(duì)照組
- weighting/pairing:重加權(quán)
- subclassification/stratification:分層,求得CATE
2 Pearl Causal Graph(SCM)
通過計(jì)算因果圖中的條件分布,獲得變量之間的因果關(guān)系。有向圖指導(dǎo)我們使用這些條件分布來消除估計(jì)偏差,其核心也是估計(jì)檢驗(yàn)分布、消除其他變量帶來的偏差。
- 鏈?zhǔn)浇Y(jié)構(gòu):常見在前門路徑,A -> C一定需要經(jīng)過B
- 叉式結(jié)構(gòu):中間節(jié)點(diǎn)B通常被視為A和C的共因(common cause)或混雜因子(confounder )。混雜因子會(huì)使A和C在統(tǒng)計(jì)學(xué)上發(fā)生關(guān)聯(lián),即使它們沒有直接的關(guān)系。經(jīng)典例子:“鞋的尺碼←孩子的年齡→閱讀能力”,穿較大碼的鞋的孩子年齡可能更大,所以往往有著更強(qiáng)的閱讀能力,但當(dāng)固定了年齡之后,A和C就條件獨(dú)立了。
- 對(duì)撞結(jié)構(gòu):AB、BC相關(guān),AC不相關(guān);給定B時(shí),AC相關(guān)
三個(gè)假設(shè)
1. 無混淆性(Unconfoundedness)
也稱之為「條件獨(dú)立性假設(shè)」(conditional independence assumption, CIA),即解決X->T的路徑。
Given the background variable, X, treatment assignment T is independent to the potential outcomes Y
(Y1,Y0)⊥W∣X(Y_1, Y_0) \perp W | X (Y1?,Y0?)⊥W∣X
該假設(shè)使得具有相同X的unit是隨機(jī)分配的。
2. 正值(Positivity)
For any value of X, treatment assignment is not deterministic
P(W=w∣X=x)>0P(W=w \mid X=x)>0 P(W=w∣X=x)>0
干預(yù)一定要有實(shí)驗(yàn)樣本;干預(yù)、混雜因子越多,所需的樣本也越多
3. 一致性(Consistency)
也可以叫「穩(wěn)定單元干預(yù)值假設(shè)」(Stable Unit Treatment Value Assumption, SUTVA)
The potential outcomes for any unit do not vary with the treatment assigned to other units, and, for each unit, there are no different forms or versions of each treatment level, which lead to different potential outcomes.
任意單元的潛在結(jié)果都不會(huì)因其他單元的干預(yù)發(fā)生改變而改變,且對(duì)于每個(gè)單元,其所接受的每種干預(yù)不存在不同的形式或版本,不會(huì)導(dǎo)致不同的潛在結(jié)果。
混淆因素
Confounders are the variables that affect both the treatment assignment and the outcome.
Confounder大多會(huì)引起偽效應(yīng)(spurious effect)和選擇偏差(selection bias)。
-
針對(duì)spurious effect,根據(jù)X分布進(jìn)行權(quán)重加和
ATE?=∑xp(x)E[YF∣X=x,W=1]?∑xp(x)E[YF∣X=x,W=0]\text { ATE }=\sum_x p(x) \mathbb{E}\left[Y^F\mid X=x, W=1\right]-\sum_x p(x) \mathbb{E}\left[Y^F \mid X=x, W=0\right] ?ATE?=x∑?p(x)E[YF∣X=x,W=1]?x∑?p(x)E[YF∣X=x,W=0] -
針對(duì)selection bias,為每個(gè)group找到對(duì)應(yīng)的pseudo group,如sample re-weighting, matching, tree-based methods, confounder balancing, balanced representation learning methods, multi-task-based methods
建模方法
1. re-weighting methods*
By assigning appropriate weight to each unit in the observational data, a pseudo-population can be created on which the distributions of the treated group and control group are similar.
通過給每個(gè)觀測(cè)數(shù)據(jù)分配權(quán)重,調(diào)整treatment和control兩個(gè)組的分布,使其接近。關(guān)鍵在于怎么選擇balancing score,propensity score是特殊情況。
e(x)=Pr?(W=1∣X=x)e(x)=\operatorname{Pr}(W=1 \mid X=x)e(x)=Pr(W=1∣X=x)
The propensity score can be used to balance the covariates in the treatment and control groups and therefore reduce the bias through matching, stratification (subclassification), regression adjustment, or some combination of all three.
1. Propensity Score Based Sample Re-weighting
IPW :r=We(x)+1?W1?e(x)r=\frac{W}{e(x)}+\frac{1-W}{1-e(x)}r=e(x)W?+1?e(x)1?W?,用r給每個(gè)樣本算權(quán)重
ATEIPW=1n∑i=1nWiYiFe^(xi)?1n∑i=1n(1?Wi)YiF1?e^(xi)\mathrm{ATE}_{I P W}=\frac{1}{n} \sum_{i=1}^n \frac{W_i Y_i^F}{\hat{e}\left(x_i\right)}-\frac{1}{n} \sum_{i=1}^n \frac{\left(1-W_i\right) Y_i^F}{1-\hat{e}\left(x_i\right)} ATEIPW?=n1?i=1∑n?e^(xi?)Wi?YiF???n1?i=1∑n?1?e^(xi?)(1?Wi?)YiF??
經(jīng)normalization,
ATEIPW=∑i=1nWiYiFe^(xi)/∑i=1nWie^(xi)?∑i=1n(1?Wi)YiF1?e^(xi)/∑i=1n(1?Wi)1?e^(xi)\mathrm{ATE}_{I P W}=\sum_{i=1}^n \frac{W_i Y_i^F}{\hat{e}\left(x_i\right)} / \sum_{i=1}^n \frac{W_i}{\hat{e}\left(x_i\right)}-\sum_{i=1}^n \frac{\left(1-W_i\right) Y_i^F}{1-\hat{e}\left(x_i\right)} / \sum_{i=1}^n \frac{\left(1-W_i\right)}{1-\hat{e}\left(x_i\right)} ATEIPW?=i=1∑n?e^(xi?)Wi?YiF??/i=1∑n?e^(xi?)Wi???i=1∑n?1?e^(xi?)(1?Wi?)YiF??/i=1∑n?1?e^(xi?)(1?Wi?)?
缺點(diǎn):極大依賴e(X)估計(jì)的準(zhǔn)確性
DR:解決propensity score估計(jì)不準(zhǔn)的問題
ATEDR=1n∑i=1n{[WiYiFe^(xi)?Wi?e^(xi)e^(xi)m^(1,xi)]?[(1?Wi)YiF1?e^(xi)?Wi?e^(xi)1?e^(xi)m^(0,xi)]}=1n∑i=1n{m^(1,xi)+Wi(YiF?m^(1,xi))e^(xi)?m^(0,xi)?(1?Wi)(YiF?m^(0,xi))1?e^(xi)}\begin{aligned} \mathrm{ATE}_{D R} &=\frac{1}{n} \sum_{i=1}^n\left\{\left[\frac{W_i Y_i^F}{\hat{e}\left(x_i\right)}-\frac{W_i-\hat{e}\left(x_i\right)}{\hat{e}\left(x_i\right)} \hat{m}\left(1, x_i\right)\right]-\left[\frac{\left(1-W_i\right) Y_i^F}{1-\hat{e}\left(x_i\right)}-\frac{W_i-\hat{e}\left(x_i\right)}{1-\hat{e}\left(x_i\right)} \hat{m}\left(0, x_i\right)\right]\right\} \\ &=\frac{1}{n} \sum_{i=1}^n\left\{\hat{m}\left(1, x_i\right)+\frac{W_i\left(Y_i^F-\hat{m}\left(1, x_i\right)\right)}{\hat{e}\left(x_i\right)}-\hat{m}\left(0, x_i\right)-\frac{\left(1-W_i\right)\left(Y_i^F-\hat{m}\left(0, x_i\right)\right)}{1-\hat{e}\left(x_i\right)}\right\} \end{aligned} ATEDR??=n1?i=1∑n?{[e^(xi?)Wi?YiF???e^(xi?)Wi??e^(xi?)?m^(1,xi?)]?[1?e^(xi?)(1?Wi?)YiF???1?e^(xi?)Wi??e^(xi?)?m^(0,xi?)]}=n1?i=1∑n?{m^(1,xi?)+e^(xi?)Wi?(YiF??m^(1,xi?))??m^(0,xi?)?1?e^(xi?)(1?Wi?)(YiF??m^(0,xi?))?}?
m^(1,xi)\hat{m}\left(1, x_i\right)m^(1,xi?)和m^(0,xi)\hat{m}\left(0, x_i\right)m^(0,xi?)是treatment和control兩組的回歸模型
The estimator is robust even when one of the propensity score or outcome regression is incorrect (but not both).
2. Confounder Balancing
D2VD :Data-Driven Variable Decomposition
根據(jù)seperation assumption,變量分為confounder、adjusted variables和irrelavant variables。
ATED2VD=E[(YF??(z))W?p(x)p(x)(1?p(x))]\mathrm{ATE}_{\mathrm{D}^2 \mathrm{VD}}=\mathbb{E}\left[\left(Y^F-\phi(\mathrm{z})\right) \frac{W-p(x)}{p(x)(1-p(x))}\right] ATED2VD?=E[(YF??(z))p(x)(1?p(x))W?p(x)?]
其中,z為調(diào)整變量
假設(shè)α,β\alpha,\betaα,β分別分離調(diào)整變量和混淆變量,即YD2VD?=(YF?Xα)⊙R(β)Y_{\mathrm{D}^2 \mathrm{VD}}^*=\left(Y^F-X \alpha\right) \odot R(\beta)YD2VD??=(YF?Xα)⊙R(β),γ\gammaγd對(duì)應(yīng)所有變量的ATE結(jié)果,則問題可以建模成
minimize?∥(YF?Xα)⊙R(β)?Xγ∥22s.t.?∑i=1Nlog?(1+exp?(1?2Wi)?Xiβ))<τ∥α∥1≤λ,∥β∥1≤δ,∥γ∥1≤η,∥α⊙β∥22=0\begin{aligned} \operatorname{minimize} &\left\|\left(Y^F-X \alpha\right) \odot R(\beta)-X \gamma\right\|_2^2 \\ \text { s.t. } &\left.\sum_{i=1}^N \log \left(1+\exp \left(1-2 W_i\right) \cdot X_i \beta\right)\right)<\tau \\ &\|\alpha\|_1 \leq \lambda,\|\beta\|_1 \leq \delta,\|\gamma\|_1 \leq \eta,\|\alpha \odot \beta\|_2^2=0 \end{aligned} minimize?s.t.??∥∥?(YF?Xα)⊙R(β)?Xγ∥∥?22?i=1∑N?log(1+exp(1?2Wi?)?Xi?β))<τ∥α∥1?≤λ,∥β∥1?≤δ,∥γ∥1?≤η,∥α⊙β∥22?=0?
第一個(gè)約束是正則項(xiàng),最后一個(gè)約束保證調(diào)整變量和混淆變量的分離
2. stratification methods
ATEstrat?=τ^strat?=∑j=1Jq(j)[Yˉt(j)?Yˉc(j)]\mathrm{ATE}_{\text {strat }}=\hat{\tau}^{\text {strat }}=\sum_{j=1}^J q(j)\left[\bar{Y}_t(j)-\bar{Y}_c(j)\right] ATEstrat??=τ^strat?=j=1∑J?q(j)[Yˉt?(j)?Yˉc?(j)]
其中,一共分成J個(gè)block,且q(j)q(j)q(j)為j-th block的比例
關(guān)鍵在于如何劃分block,典型方法有等頻法,基于出現(xiàn)概率(如PS)劃分相似樣本。但是,該方法在兩側(cè)重疊區(qū)域小,從而導(dǎo)致高方差。
However, this approach suffers from high variance due to the insufficient overlap between treated and control groups in the blocks whose propensity score is very high or low.
3. matching methods*
4. tree-based methods*
This approach is different from conventional CART in two aspects. First, it focuses on estimating conditional average treatment effects instead of directly predicting outcomes as in the conventional CART. Second, different samples are used for constructing the partition and estimating the effects of each subpopulation, which is referred to as an honest estimation. However, in the conventional CART, the same samples are used for these two tasks.
5. representation based methods
1. Domain Adaptation Based on Representation Learning
Unlike the randomized control trials, the mechanism of treatment assignment is not explicit in observational data. The counterfactual distribution will generally be
different from the factual distribution.
關(guān)鍵在于縮小反事實(shí)分布和實(shí)際分布的差別,即源域和目標(biāo)域
6. multi-task methods
7. meta-learning methods*
1. S-learner
S-learner是將treatment作為特征,所有數(shù)據(jù)一起訓(xùn)練
- step1: μ(T,X)=E[Y∣T,X]\mu(T, X)=E[Y \mid T, X]μ(T,X)=E[Y∣T,X]
- step2: τ^=1n∑i(μ^(1,Xi)?μ^(0,Xi))\hat{\tau}=\frac{1}{n} \sum_i\left(\hat{\mu}\left(1, X_i\right)-\hat{\mu}\left(0, X_i\right)\right)τ^=n1?∑i?(μ^?(1,Xi?)?μ^?(0,Xi?))
該方法不直接建模uplift,X的high dimension可能會(huì)導(dǎo)致treatment丟失效果。
2. T-learner
T-learner分別對(duì)control和treatment組建模
- step1: μ1(X)=E[Y∣T=1,X]μ0(X)=E[Y∣T=0,X]\mu_1(X)=E[Y \mid T=1, X] \quad \mu_0(X)=E[Y \mid T=0, X]μ1?(X)=E[Y∣T=1,X]μ0?(X)=E[Y∣T=0,X]
- step2: τ^=1n∑i(μ^1(Xi)?μ0^(Xi))\hat{\tau}=\frac{1}{n} \sum_i\left(\hat{\mu}_1\left(X_i\right)-\hat{\mu_0}\left(X_i\right)\right)τ^=n1?∑i?(μ^?1?(Xi?)?μ0?^?(Xi?))
每個(gè)estimator只使用部分?jǐn)?shù)據(jù),尤其當(dāng)樣本不足或者treatment、control樣本量差別較大時(shí),模型variance較大(對(duì)數(shù)據(jù)利用效率低);容易出現(xiàn)兩個(gè)模型的Bias方向不一致,形成誤差累積,使用時(shí)需要針對(duì)兩個(gè)模型打分分布做一定校準(zhǔn);同時(shí)當(dāng)數(shù)據(jù)差異過大時(shí)(如數(shù)據(jù)量、采樣偏差等),對(duì)準(zhǔn)確率影響較大。
3. X-learner
X-Learner在T-Learner基礎(chǔ)上,利用了全量的數(shù)據(jù)進(jìn)行預(yù)測(cè),主要解決Treatment組間數(shù)據(jù)量差異較大的情況。
- step1: 對(duì)實(shí)驗(yàn)組和對(duì)照組分別建立兩個(gè)模型μ^1\hat \mu_1μ^?1?和μ^0\hat \mu_0μ^?0?
D0=μ^1(X0)?Y0D1=Y1?μ^0(X1)\begin{aligned} &D_0=\hat{\mu}_1\left(X_0\right)-Y_0 \\ &D_1=Y_1-\hat{\mu}_0\left(X_1\right) \end{aligned} ?D0?=μ^?1?(X0?)?Y0?D1?=Y1??μ^?0?(X1?)? - step2: 對(duì)求得的實(shí)驗(yàn)組和對(duì)照組增量D1和 D0D 0D0 建立兩個(gè)模型 τ^1\hat{\tau}_1τ^1? 和 τ^0\hat{\tau}_0τ^0? 。
τ^0=f(X0,D0)τ^1=f(X1,D1)\begin{aligned} &\hat{\tau}_0=f\left(X_0, D_0\right) \\ &\hat{\tau}_1=f\left(X_1, D_1\right) \end{aligned} ?τ^0?=f(X0?,D0?)τ^1?=f(X1?,D1?)? - step3: 引入傾向性得分模型 e(x)e(x)e(x) 對(duì)結(jié)果進(jìn)行加權(quán),求得增量。
e(x)=P(W=1∣X=x)τ^(x)=e(x)τ^0(x)+(1?e(x))τ^1(x)\begin{aligned} &e(x)=P(W=1 \mid X=x) \\ &\hat{\tau}(x)=e(x) \hat{\tau}_0(x)+(1-e(x)) \hat{\tau}_1(x) \end{aligned} ?e(x)=P(W=1∣X=x)τ^(x)=e(x)τ^0?(x)+(1?e(x))τ^1?(x)?
4. R-learner
總結(jié)
以上是生活随笔為你收集整理的Casual inference 综述框架的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 现代信息系统的企业驱动力
- 下一篇: 弥散阴影html,设计弥散阴影效果海报图