NHANES数据库的介绍及使用(二)
前一篇介紹了NHANES數據庫的加權及數據的下載NHANSE數據庫的介紹及使用(一)_Christina-CSDN博客,這一篇主要介紹數據庫如何導入軟件進行下一步計算合并。
例一:
以NHANSE數據庫的文章為例(Brody DJ, Pratt LA, Hughes J. Prevalence of depression among adults aged 20 and over: United States, 2013-2016. NCHS Data Brief, no 303. Hyattsville, MD: National Center for Health Statistics. 2018.)
1.加載安裝包
library(dplyr) library(survey)2.下載數據
此步驟可以在官網上下載,或使用軟件下載。
# Download & Read SAS Transport Files # Demographic (DEMO) download.file("https://wwwn.cdc.gov/nchs/nhanes/2013-2014/DEMO_H.XPT", tf <- tempfile(), mode="wb") DEMO_H <- foreign::read.xport(tf)[,c("SEQN","RIAGENDR","RIDAGEYR","SDMVSTRA","SDMVPSU","WTMEC2YR")] download.file("https://wwwn.cdc.gov/nchs/nhanes/2015-2016/DEMO_I.XPT", tf <- tempfile(), mode="wb") DEMO_I <- foreign::read.xport(tf)[,c("SEQN","RIAGENDR","RIDAGEYR","SDMVSTRA","SDMVPSU","WTMEC2YR")]# Mental Health - Depression Screener (DPQ) download.file("http://wwwn.cdc.gov/nchs/nhanes/2013-2014/DPQ_H.XPT", tf <- tempfile(), mode="wb") DPQ_H <- foreign::read.xport(tf) download.file("http://wwwn.cdc.gov/nchs/nhanes/2015-2016/DPQ_I.XPT", tf <- tempfile(), mode="wb") DPQ_I <- foreign::read.xport(tf)3.合并數據
# Append Files DEMO <- bind_rows(DEMO_H, DEMO_I) DPQ <- bind_rows(DPQ_H, DPQ_I)# Merge DEMO and DPQ files and create derived variablesOne <- left_join(DEMO, DPQ, by="SEQN") %>%# Set 7=Refused and 9=Don't Know To Missing for variables DPQ010 thru DPQ090 ##mutate_at(vars(DPQ010:DPQ090), ~ifelse(. >=7, NA, .)) %>%mutate(. , # create indicator for overall summaryone = 1,# Create depression score as sum of variables DPQ010 -- DPQ090Depression.Score = rowSums(select(. , DPQ010:DPQ090)),# Create depression indicator as binary 0/100 variable. (is missing if Depression.Score is missing)Depression= ifelse(Depression.Score >=10, 100, 0), # Create factor variablesGender = factor(RIAGENDR, labels=c("Men", "Women")),Age.Group = cut(RIDAGEYR, breaks=c(-Inf,19,39,59,Inf),labels=c("Under 20", "20-39","40-59","60 and over")),# Generate 4-year MEC weight (Divide weight by 2 because we are appending 2 survey cycles) # Note: using the MEC Exam Weights (WTMEC2YR), per the analytic notes on the # Mental Health - Depression Screener (DPQ_H) documentation WTMEC4YR = WTMEC2YR/2 ,# Define indicator for analysis population of interest: adults aged 20 and over with a valid depression scoreinAnalysis= (RIDAGEYR >= 20 & !is.na(Depression.Score))) %>% # drop DPQ variablesselect(., -starts_with("DPQ"))由于使用了兩年的數據,因此weight需要計算,WTMEC4YR=WTMEC2YR
4.定義survey數據集
NHANES_all <- svydesign(data=One, id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC4YR, nest=TRUE)選擇子集
NHANES <- subset(NHANES_all, inAnalysis)5.統計分析
計算加權均值及標準差,定義函數
getSummary <- function(varformula, byformula, design){# Get mean, stderr, and unweighted sample sizec <- svyby(varformula, byformula, design, unwtd.count ) p <- svyby(varformula, byformula, design, svymean ) outSum <- left_join(select(c,-se), p) outSum }計算抑郁分層的結果?
getSummary(~Depression, ~one, NHANES) #' By sex getSummary(~Depression, ~Gender, NHANES) #' By age getSummary(~Depression, ~Age.Group, NHANES) #' By sex and age getSummary(~Depression, ~Gender + Age.Group, NHANES)注意,在NHANSE數據庫使用過程中,首先要定義survey數據集,再進行subset運算,? 不能直接subset取子集計算,否則會導致有偏估計。
例二:以今年發表在EST上的文獻為例
Exposure: chloroform (TCM); bromodichloromethane (BDCM); dibromochloromethane (DBCM); bromoform (TBM)
Outcome: thyroid function (FT4;FT3; TT4;TT3; TPOAb; TgAb)
Exclusion criterion: thyroid diseases; prescription medications, pregnant status, <20 years old
Covariates: demographic data; serum cotinine
Year: 2007-2008
1.數據下載
根據文獻中的暴露變量及結局變量,協變量,下載相應數據集。
2.數據導入
setwd("C:\\Users\\18896\\Desktop\\NHANSE20211110\\example1") library(foreign)DEMO_E <- read.xport("DEMO_E.XPT") BMX_E<-read.xport("BMX_E.XPT") MCQ_E <- read.xport("MCQ_E.XPT") RHQ_E <- read.xport("RHQ_E.XPT") RXQ_RX_E <- read.xport("RXQ_RX_E.XPT") RXQ_RX_E<-subset(RXQ_RX_E,!duplicated(RXQ_RX_E$SEQN))THYROD_E <- read.xport("THYROD_E.XPT") VOCMWB_E <- read.xport("VOCMWB_E.XPT")3.數據合并?
##數據庫整合 ##并集(Union) data_E <- DEMO_E data_E <- merge(data_E, MCQ_E, by = "SEQN", all = T) data_E <- merge(data_E, BMX_E, by = "SEQN", all = T) data_E <- merge(data_E, RHQ_E, by = "SEQN", all = T) data_E <- merge(data_E, RXQ_RX_E, by = "SEQN", all = T) data_E <- merge(data_E, THYROD_E, by = "SEQN", all = T) data_E <- merge(data_E, VOCMWB_E, by = "SEQN", all = T)SEQN為唯一ID識別碼,注意merge時,使用all=T,否則會丟失樣本。
4.數據重命名
在進行數據計算時,需要將我們選擇的變量進行重新命名以更好識別
data_new <- plyr::rename(data_E,c(RIDAGEYR="age",DMDEDUC2="Education",RIDEXPRG="pregnant.status",RIAGENDR="Gender",RIDRETH1="race",BMXWT="weight",BMXHT="height",BMXBMI="BMI",LBXVBF="Bromoform",LBXVBM="Bromodichloromethane",LBXVCF="Chloroform",LBXVCM="Dibromochloromethane",LBXT3F="FT3",LBXT4F="FT4",LBXTT3="TT3",LBXTT4="TT4",LBXTPO="TPOAb",LBXATG="TgAb",RXDUSE="medication",MCQ160M="thyroid.deseases"))5.數據加權
library(survey) design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=TRUE,data=data_new)design_new<-subset(design,SEQN%in%VOCMWB_E$SEQN & age>=20 & thyroid.deseases!=1 &!is.na(FT4) & !is.na(FT3) &!is.na(TT3) &!is.na(TT4) &!is.na(TPOAb) & !is.na(TgAb))data_2<-subset(data_new,SEQN%in%VOCMWB_E$SEQN & age>=20 & thyroid.deseases!=1 &!is.na(FT4) & !is.na(FT3) &!is.na(TT3) &!is.na(TT4) &!is.na(TPOAb) & !is.na(TgAb))6.一般情況分析
計算了數據中的年齡及種族加權及未加權的均值或比例,可以看出加權及未加權結果有很大差異,對數據進行基線信息描述時,應該使用加權結果。
#unweighted age and se mean(data_2$age,na.rm=T) #49.54916# weighted age and se svymean(~age, design_new, na.rm = TRUE) #45.874#' Proportion of unweighted interview sample data_2 %>% count(race) %>% mutate(prop= round(n / sum(n)*100, digits=1))#' Proportion of weighted interview sample data_2 %>% count(race, wt=WTMEC2YR) %>%mutate(prop= round(n / sum(n)*100, digits=1))具體在論文中呈現時,可以參考以下方式
7.svyglm分析
使用常規的glm和weighted glm會對結果進行有偏估計,應該在構建survey數據庫的基礎上,進行svyglm分析,以下是三個方法的比較。
#glm Result2 <- glm(TT4~Bromoform+age+Gender+race+BMI+Education,family = gaussian(), data=data_2) summary(Result2)#weighted glm Result3 <- glm(TT4~Bromoform+age+Gender+race+BMI+Education,family = gaussian(), data=data_2,weights =WTMEC2YR ) summary(Result3)#survey-weighted glm Result1 <- svyglm(TT4~Bromoform+age+Gender+race+BMI+Education,family = gaussian(), data=data_2,design=design_new) summary(Result1)ref:
Sun Y, Xia PF, Korevaar TI, Mustieles V, Zhang Y, Pan XF, Wang YX, Messerlian C. Relationship between Blood Trihalomethane Concentrations and Serum Thyroid Function Measures in US Adults. Environmental Science & Technology. 2021 Oct 7.
Brody DJ, Pratt LA, Hughes JP. Prevalence of depression among adults aged 20 and over: United States, 2013-2016.
Emecen-Huja P, Li HF, Ebersole JL, Lambert J, Bush H. Epidemiologic evaluation of Nhanes for environmental Factors and periodontal disease. Scientific reports. 2019 Jun 3;9(1):1-1.
總結
以上是生活随笔為你收集整理的NHANES数据库的介绍及使用(二)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 什么是鸭子类型(duck typing)
- 下一篇: IaaS(基础设施即服务),PaaS(平