硬核干货:如果样本量不一一样多,或者不是一一对应关系,如何做差异?相关?...
寫在前面
因為我們測定的樣本不總是可以匹配上的,但是最少的樣本也不能太少,我們測定的三個樣本做相關其實還是被質疑,但是測定了5或者6個重復,這個時候直接將樣本比較多的樣本過濾掉比較少的樣本不就可以了嗎?這樣做一次是沒有什么意義的,但是如果將這個過程重復多次,讓每個重復都使用的上,就會大大增加結果的可行性。當然審稿人就可以承認了。
例如下面這個文章
參考原文 ismej Metabolic regulation of the maize rhizobiome by benzoxazinoids
作者信息 :T. E. Anne Cotton1,2 ● Pierre Pétriacq1,2,3,5 ● Duncan D. Cameron1,2 ● Moaed Al Meselmani1,2,4 ● Roland Schwarzenbacher1,2 ● Stephen A. Rolfe 1,2 ● Jurriaan Ton 1,2
發表:2019年
作者親自書寫:
The correlations are calcuated on an average (as the data are not paired) but this will give a false sense of precision in some interactions. One way to cope with this is to randomly assign ions to specific OTU counts and calculate the correlation. If we do this multiple times and then take the average (and SD) we have a better measure of the true correlation.
To see what occurs by chance we can assign the ions randomly across all of the samples rather than just within a treatment. Some false positives will occur by chance but we can then assess whether we get more in specific interactions.
Algorithm
8 bacterial samples
6 metabolomic samples
Within a treatment assign 6 of the metabolomic samples at random to the bacterial samples
Calculate the correlation coefficient
Repeat this process x times and calculate the mean value
那么這個過程是如何實現的呢,下面我帶大家一起來實現這個過程:
構建隨機OTU矩陣
library(tidyverse)library(vegan)library (plyr)# 構建OTU表格 otu<-matrix(sample(1:500, 100),ncol=16,nrow = 300) row.names(otu) = paste("ASV_",1:nrow(otu),sep = "") colnames(otu) = c(paste("A",1:4,sep = ""),paste("B",1:4,sep = ""),paste("C",1:4,sep = ""),paste("D",1:4,sep = "")) head(otu)## A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4 ## ASV_1 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 ## ASV_2 259 259 259 259 259 259 259 259 259 259 259 259 259 259 259 259 ## ASV_3 407 407 407 407 407 407 407 407 407 407 407 407 407 407 407 407 ## ASV_4 272 272 272 272 272 272 272 272 272 272 272 272 272 272 272 272 ## ASV_5 209 209 209 209 209 209 209 209 209 209 209 209 209 209 209 209 ## ASV_6 304 304 304 304 304 304 304 304 304 304 304 304 304 304 304 304#--相對豐度標準化 otunorm = t(t(otu)/colSums(otu)) otunorm = as.data.frame(t(otunorm))otunorm$Group = c(rep("A",4),rep("B",4),rep("C",4),rep("D",4))構建代謝組隨機矩陣
代謝組表格和otu表格處理相同,但是有的重復為3個,有的重復為5個,不僅僅自己不統一重復數量,同時和otu的重復數量也不一致。
# 構建代謝組矩陣 #GC<-matrix(sample(1:500, 100),ncol=16,nrow = 100) row.names(GC) = paste("GC_",1:nrow(GC),sep = "") colnames(GC) = c(paste("A",1:3,sep = ""),paste("B",1:5,sep = ""),paste("C",1:3,sep = ""),paste("D",1:5,sep = "")) head(GC)## A1 A2 A3 B1 B2 B3 B4 B5 C1 C2 C3 D1 D2 D3 D4 D5 ## GC_1 186 186 186 186 186 186 186 186 186 186 186 186 186 186 186 186 ## GC_2 171 171 171 171 171 171 171 171 171 171 171 171 171 171 171 171 ## GC_3 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 ## GC_4 321 321 321 321 321 321 321 321 321 321 321 321 321 321 321 321 ## GC_5 419 419 419 419 419 419 419 419 419 419 419 419 419 419 419 419 ## GC_6 199 199 199 199 199 199 199 199 199 199 199 199 199 199 199 199#--相對豐度標準化 GCnorm = t(t(GC)/colSums(GC)) # mapGC = data.frame(ID =c(paste("A",1:3,sep = ""),paste("B",1:5,sep = ""),paste("C",1:3,sep = ""),paste("D",1:5,sep = "")), # Group = c(rep("A",3),rep("B",5),rep("C",3),rep("D",5))) GCnorm = as.data.frame(t(GCnorm))GCnorm$Group = c(rep("A",3),rep("B",5),rep("C",3),rep("D",5))提取各自的重復數量
#make the code robust to allow different sample depths otu_reps<-data.frame(dplyr::count(otunorm,Group)) met_reps<-data.frame(dplyr::count(GCnorm,Group)) #get the minimum number otu_min_reps<-min(otu_reps$n) otu_min_reps## [1] 4met_min_reps<-min(met_reps$n) met_min_reps## [1] 3min_reps=min(otu_min_reps,met_min_reps) min_reps## [1] 3準備運行參數
相關參數和運行結果存儲對象,順便做一個進度條
depth=min_reps repeats=10 cormethod = "spearman" pb <- txtProgressBar(min=1,max=repeats,style=3,width=50)cor_list<-vector("list",repeats) a = 1 R = c() p = c()for (a in 1:repeats){Sys.sleep(0.1)setTxtProgressBar(pb,a)#this has two effects - it randomises the order and reduces the sample number to depthotu_select<-data.frame(otunorm %>% group_by(Group) %>% sample_n(depth))otu_select$Group = NULLdistotu <- vegan::vegdist(otu_select) met_select<-data.frame(GCnorm %>% group_by(Group) %>% sample_n(depth))met_select$Group = NULLdistGC <- vegan::vegdist(otu_select,"euclid") res <- vegan::mantel(distotu, distGC, method="spearman")p[a] = res$signifR[a] = res$statistic }## | | | 0%result = data.frame(R_mean = mean(R),R_sd = sd(R),p_mean = mean(p),p_sd = sd(p))這里的R值和p值都是NA,因為通過隨機矩陣并沒有差異,所以距離矩陣都是0,使得兩個模擬數據不存在相關或者差異。
result## R_mean R_sd p_mean p_sd ## 1 NA NA NA NA往期精品(點擊圖片直達文字對應教程)
機器學習
后臺回復“生信寶典福利第一波”或點擊閱讀原文獲取教程合集
總結
以上是生活随笔為你收集整理的硬核干货:如果样本量不一一样多,或者不是一一对应关系,如何做差异?相关?...的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 轻轻松松画个热图
- 下一篇: 哈佛大学教授刘小乐:我与生物信息学的不解