数据下载工作笔记三:脚本
一共寫(xiě)了三個(gè)腳本,第一次寫(xiě)shell腳本,很蹩腳。寫(xiě)完回頭看一看,質(zhì)量確實(shí)很差勁。
腳本一:addLinkFiles.sh
在當(dāng)前目錄下有一個(gè)xmlfile.xml文件,該文件需要手動(dòng)編輯,文件中用標(biāo)簽表示出來(lái)兩級(jí)目錄,例如:
<class>Life-sciences</class>
<dataset>uniprot</dataset>
<location>http://www.example.com/example.nt.gz</location>
沒(méi)有使用xml樹(shù)形結(jié)構(gòu),因?yàn)閟hell中解析實(shí)在是太麻煩了。用相對(duì)位置表示上下級(jí)關(guān)系。所以這并不是一個(gè)嚴(yán)格的xml文件,四不像。上面三句話表示class中有dataset,dataset中有一個(gè)下載鏈接,對(duì)應(yīng)的目錄結(jié)構(gòu)為"Life-sciences/uniprot/link",link中保存鏈接。addLinkFiles.sh會(huì)在啟動(dòng)時(shí)檢查一下該文件,并在./download/data/中檢查linkpath = ${class}/${dataset}/link文件是否存在,若不存在,新建目錄和文件,然后添加下載鏈接到link文件中,并將${linkpath}添加到${modifiedLinkFile}中,每隔${interval}秒,腳本會(huì)檢查一次${xmlfile}的修改時(shí)間,如果修改時(shí)間改變了,說(shuō)明有新的location添加了,檢查目錄和xml文件內(nèi)容的對(duì)應(yīng)關(guān)系,給xml中新添加的內(nèi)容建立相應(yīng)的目錄和文件。
#!/bin/bash #********************************************************* #addLinkFiles.sh#Keep checking the $xmlfile. #The $xmlfile shoul have only 3 tags: class, name, location. # #last edited 2013.09.03 by Lyuxd. # #*********************************************************#****************** #----init---------- #****************** interval=10 rootDir=${PWD} dataDir=$rootDir"/data" logDir=$rootDir"/log" link="link" log="add.log" modifiedLinkFile="modifiedlinkfile" xmlfile="xmlfile.xml" level1="class" level2="name" level3="location" currentClass=$rootDir currentDataSet=$rootDir xmlLastMT=0 cd $rootDir#**************************************** #------Create Data, Log Directories------ #**************************************** if [ ! -d "$dataDir" ];thenmkdir "$dataDir" fiif [ ! -d "$logDir" ];thenmkdir "$logDir" fi#**************************************** #------Parsing the xmlfile------ #**************************************** if [ ! -f "$xmlfile" ];thenecho "`date "+%Y.%m.%d-%H:%M:%S--ERROR: "`No xmlfile found. exit." >> "$logDir/$log"exit 1 fi#check the modified-time of xmlfile every $interval sec. If modified-time changed parse xmlfile. while true do xmlMT=$(stat -c %Y $xmlfile|awk '{print $0}') if [ "$xmlLastMT" -lt "$xmlMT" ];then xmlLastMT=$xmlMT echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`parsing $xmlfile..." >> "$logDir/$log"while read line do#bu wei kong. if [ "$line"x != x ];then tmp=$(echo $line | awk -F "<|>| " '{print $2}') tag=$(echo $tmp)#check if "class" is existing. If not, create it.if [ "$tag"x = "$level1"x ]; thencurrentClass=$(echo $line | awk -F "<$tmp>|</$tmp>" '{print $2}')currentClass=$(echo $currentClass)currentDataSet=$rootDirif [ ! -z "$currentClass" ] && [ ! -d "$dataDir/$currentClass" ]; thenmkdir "$dataDir/$currentClass"echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`mkdir $dataDir/$currentClass" >> "$logDir/$log"fi#check if "name" is existing. If not, create it.elif [ "$tag"x = "$level2"x ] && [ "$currentClass" != "$rootDir" ]; thencurrentDataSet=$(echo $line | awk -F "<$tmp>|</$tmp>" '{print $2}')currentDataSet=$(echo $currentDataSet)if [ ! -z "$currentClass" ] &&[ ! -z "$currentDataSet" ] && [ ! -d "$dataDir/$currentClass/$currentDataSet" ]; then mkdir "$dataDir/$currentClass/$currentDataSet"echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`mkdir $dataDir/$currentDataSet" >> "$logDir/$log"fi#check if "link" is existing. If not, create it.elif [ "$tag"x = "$level3"x ] && [ ! -z "$currentClass" ] && [ ! -z "$currentDataSet" ] && [ "$currentDataSet" != "$rootDir" ] && [ -d "$dataDir/$currentClass/$currentDataSet" ]; thenif [ ! -f "$dataDir/$currentClass/$currentDataSet/$link" ]; thentouch "$dataDir/$currentClass/$currentDataSet/$link" echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Create link file : $dataDir/$currentClass/$currentDataSet/$link" >> "$logDir/$log"finewRecord=$(echo $line | awk -F "<$tmp>|</$tmp>" '{print $2}')ifexit=$(grep "$newRecord" "$dataDir/$currentClass/$currentDataSet/$link") if [ ! -z "$newRecord" ] && [ -z $ifexit ]; then #不存在相同的記錄echo "$newRecord" >> "$dataDir/$currentClass/$currentDataSet/$link"echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Add new link $newRecord to $datatDir/$currentClass/$currentDataSet/$link" >> "$logDir/$log"echo "$dataDir/$currentClass/$currentDataSet/$link" >> "$logDir/modifiedLinkFile.tmp"fielseecho "`date "+%Y.%m.%d-%H:%M:%S--ERROR: "`Failed to process $line" >> "$logDir/$log"fi fi done <$xmlfile#****************************** #modifiedLinkFile.tmp contains the paths who were modified in last loop. #Deduplicate modifiedLinkFile.tmp --> modifiedLinkFile #****************************** if [ -f "$logDir/modifiedLinkFile.tmp" ]; thencat "$logDir/modifiedLinkFile.tmp"| awk '!a[$0]++{"date \"+%Y%m%d%H%M%S\""|getline time; print time,$0}' >> "$logDir/$modifiedLinkFile"rm "$logDir/modifiedLinkFile.tmp" elsetouch "$logDir/$modifiedLinkFile" fi fi #echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`parsing end." >> "$logDir/$log" sleep $interval done
腳本二:checkmodifiedLinkFiles.sh
modifiedlinkfile文件是由上面的腳本一來(lái)添加內(nèi)容的,腳本二會(huì)以interval為間隔檢查modifiedlinkfile文件的修改時(shí)間,如果修改時(shí)間發(fā)生改變,說(shuō)明該文件被腳本一修改過(guò)了,也就是說(shuō),xmlfile中添加了新的下載鏈接,并且建立了對(duì)應(yīng)的目錄。此時(shí),腳本二就會(huì)將modifiedlinkfile中的記錄取出來(lái)(記錄是新建的link文件的絕對(duì)路徑),調(diào)用腳本三monitot.sh執(zhí)行下載任務(wù)。
#!/bin/bash #************************************************* #This script reads in modifiedLinkFile, #for every record calling monitor.sh. #monitor.sh /home/class/name "wget -c -i link -b" # #last edited 2013.09.10 by lyuxd. # #*************************************************interval=10 rootDir=${PWD} dataDir=$rootDir"/data" logDir=$rootDir"/log" failedqueue="$logDir/failedQueue" runningTask="$logDir/runningTask" modifiedLinkFile="$logDir/modifiedLinkFile" modifiedLinkFileMT="$logDir/modifiedLinkFile.MT" log=$logDir"/check.log" maxWgetProcess=5 echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`check is running...">>$log #***************************************** #-----------restart interrupted tasks----- #***************************************** if [ -f "$runningTask" ]; thenwhile read linedocounterWgetProcess=$(ps -A|grep -c "monitor.sh")while [ $counterWgetProcess -ge $maxWgetProcess ]dosleep 20counterWgetProcess=$(ps -A|grep -c "monitor.sh")doneecho "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Call ./monitor for $downloadDir." >> $log nohup "./monitor.sh" "$line" "wget -nd -c -i link -b" >> /dev/null &sleep 1done <$runningTask fi #********************************* #------------failedQueue----- #********************************* #if [ -f "$failedqueue" ] && [ `ls -l "$failedqueue"|awk '{print $5}'` -gt "0" ];then # line=($(awk '{print $0}' $failedqueue)) # echo ${line[1]} # :>"$failedqueue" # for ((i=0;i<${#line[@]};i++)) # do # counterWgetProcess=$(ps -A|grep -c "monitor.sh") # while [ $counterWgetProcess -ge $maxWgetProcess ] # do # sleep 20 # counterWgetProcess=$(ps -A|grep -c "monitor.sh") # done # echo "./monitor.sh" "${line[i]}" "wget -nd -c -i link -b" # "./monitor.sh" "${line[i]}" "wget -nd -c -i link -b" >> /dev/null & #ex "$failedqueue" <<EOF #1d #wq #EOF # done #fi #*************************************************** #------------check new task in modifiedLinkFile----- #*************************************************** if [ ! -f "$modifiedLinkFile" ];thenecho "`date "+%Y.%m.%d-%H:%M:%S--"`No modifiedLinkFile found. checkmodifiedLinkFiles.sh exit 1." >> $logexit 1 fi if [ ! -f "$modifiedLinkFileMT" ];thenecho "0" > "$modifiedLinkFileMT" fi while true donewMT=$(stat -c %Y $modifiedLinkFile|awk '{print $0}') oldMT=$(awk '{print $0}' "$modifiedLinkFileMT")if [ "$newMT" != "$oldMT" ]; then while read line do if [ ! -z "$line" ] && [ "$line" != "" ]; thencounterWgetProcess=$(ps -A|grep -c "monitor.sh")while [ $counterWgetProcess -ge $maxWgetProcess ]do#echo "waiting 20sec"sleep 20counterWgetProcess=$(ps -A|grep -c "monitor.sh")donenewLink=$(echo $line |awk '{print $2}')linkfileName=$(echo $newLink |awk -F "/" '{print $NF}')downloadDir=$(echo $newLink|awk -F "$linkfileName" '{print $1}')echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Call ./monitor for $downloadDir." >> $log"./monitor.sh" "$downloadDir" "wget -nd -c -i $linkfileName -b" >> /dev/null &sleep 1fi done <$modifiedLinkFile : > $modifiedLinkFile echo $(stat -c %Y $modifiedLinkFile|awk '{print $0}') > "$modifiedLinkFileMT" #else#echo "nothing to do" fi sleep $interval done腳本三:monitor.sh 這個(gè)腳本主要就是被腳本二調(diào)用,執(zhí)行具體的下載任務(wù)了。下載前會(huì)在Life-sciences/uniprot下新建一個(gè)wgetlog目錄,目錄中存放下載日志wget-log。下載過(guò)程中,monitor.sh會(huì)以10S為時(shí)間間隔不斷檢查日志文件的大小,一旦文件大小在連續(xù)兩次檢查中沒(méi)有發(fā)生改變,則去查看日志的最后三行,發(fā)現(xiàn)FINISH或者failed等關(guān)鍵字時(shí),就停止下載,并且通過(guò)郵件通知。如果發(fā)現(xiàn)日志后三行沒(méi)有找到關(guān)鍵字,則認(rèn)為是網(wǎng)絡(luò)速度有問(wèn)題,導(dǎo)致下載速度為0,所以日志沒(méi)有增長(zhǎng),在interval時(shí)間后重新檢查日志大小,重復(fù)此過(guò)程共maxchecktimes次,如果還是沒(méi)有增長(zhǎng),則將該錯(cuò)誤通過(guò)郵件通知。
#!/bin/bash #********************************************************* #monitor download directory. #One moniter.sh process is started for one download task. #IF some url in $downloadDir/link can't be reached, monitor #will log "WARNING". If load failed, log "ERROR". If #finished, log "FINISH". #mail to $mailAddress. # #Last edited 2013.09.04 by Lyuxd. # #*********************************************************#every $interval sec check the size of wgetlog. interval=30#if size of wgetlog stay the same, try $maxtrytimes to check maxtrytimes=5downloadDir=$1 command=$2rootDir=${PWD} dataDir=$rootDir"/data" logDir=$rootDir"/log" log=$logDir"/monitor.log" wgetlogDir="$downloadDir/wgetlog" wgetlogname="`date +%Y%m%d%H%M%S`-wgetlog" wgetlog="$wgetlogDir/$wgetlogname" failedqueue="$logDir/failedQueue" runningTask="$logDir/runningTask" mailAddress="15822834587@139.com" lastERROR="e" addtoBoolean=0cd $downloadDir sleep 1 counterMail=0echo "`date "+%Y.%m.%d-%H:%M:%S--"`Monitor for directory: ${PWD}.">> $log whereAmI=$(echo ${PWD} | awk -F "/" '{print $NF}') if [ ! -d $wgetlogDir ]; then mkdir $wgetlogDir fi # Put current task into runningTask is case of power off. When checkmodifiedLinkFile.sh up, runningTask will be checked if some task interrupted. And interrupted task will be started again by checkmodifiedLinkFile.sh . isexit=$(grep $downloadDir $runningTask) if [ -z "$isexit" ];then echo $downloadDir >> $runningTask fi#Begainning downloading. `$command -b -o "$wgetlog" &`#Check the size of logfile every $interval times. #Continue cheching Until size is same with it in #last check, then wait a $interval long period time, #try again, try again...(try $maxtrytimes totally) #read in wgetlog to find if there is #something not right. #Mail to $mailAddress. trytimesRemain=$maxtrytimes logoldsize=0 sleep 10 lognewsize=$(echo $(ls -l $wgetlog | awk '{print $5}')) while [ ! -z "$lognewsize" ] && [ "$trytimesRemain" -gt 0 ] do# If log's size stays unchanging in $interval*$maxtrytime # find "FINISH" from log. # if [ "$lognewsize" -eq "$logoldsize" ];thenmessage=$(tail -n3 "$wgetlog")level=$(echo $message|grep "FINISH")if [ -z "$level" ];thentrytimesRemain=`expr $trytimesRemain - 1`echo "`date "+%Y.%m.%d-%H:%M:%S--"`WARNNING: $downloadDir Download speed 0.0 KB/s. MaxTryTimes=$maxtrytimes. Try(`expr $maxtrytimes - $trytimesRemain`). ">> $logelsebreakfielsetrytimesRemain=$maxtrytimesfiERROR=$(tail -n250 "$wgetlog" | grep "ERROR\|failed")if [ ! -z "$ERROR" ] && [ "$ERROR" != "$lastERROR" ] && [ "$counterMail" -lt 5 ]thenecho "`date "+%Y.%m.%d-%H:%M:%S--"`WARNNING: $downloadDir $ERROR. mail to $mailAddress.">> $logecho -e "${PWD}\n$ERROR\n"|mutt -s "Wget Running State : WARNNING in $whereAmI" $mailAddresscounterMail=$counterMail+1lastERROR=$ERRORaddtoBoolean=1filogoldsize=$lognewsizesleep $intervallognewsize=$(echo $(ls -l $wgetlog | awk '{print $5}')) doneif [ ! -z "$level" ]thenecho "`date "+%Y.%m.%d-%H:%M:%S--"`FINISHI: $message. mail to $mailAddress.">> $logecho -e "`date '+%Y-%m-%d +%H:%M:%S'`\n${PWD}\n$message\n"|mutt -s "Wget Report : FINISH $whereAmI--RUNNING $(ps -A|grep -c wget)" $mailAddresscounterMail=$counterMail+1 elseecho "`date "+%Y.%m.%d-%H:%M:%S--"`ERROR: $message. mail to $mailAddress.">> $logecho -e "`date '+%Y-%m-%d +%H:%M:%S'`\n${PWD}\n$message\n"|mutt -s "Wget Report : ERROR in $whereAmI" $mailAddressaddtoBoolean=1counterMail=$counterMail+1 fiif [ "$addtoBoolean" -eq "1" ];then echo "$downloadDir" >> "$failedqueue" fi#Remove the interrupted task from runningTask. sed -i "/$whereAmI/d" "$runningTask" echo "`date "+%Y.%m.%d-%H:%M:%S--"`$downloadDir Monitor ending.">> $log總結(jié):第一次寫(xiě)shell腳本,中間基本上每修改一次都會(huì)產(chǎn)生很多錯(cuò)誤。腳本的質(zhì)量也很差,好在三個(gè)腳本的耦合度不算太高,分工還算明確,這也帶來(lái)了不少方便。由于平時(shí)工作電腦是教育網(wǎng),而下數(shù)據(jù)用的是聯(lián)通的PPPoE撥號(hào),所以ssh訪問(wèn)速度也比較慢,雖然所有工作都簡(jiǎn)化為了維護(hù)一個(gè)xml文件(好吧,嚴(yán)格說(shuō),它根本不是xml文件,只是一個(gè)帶標(biāo)簽的文本而已),但是ssh上敲一個(gè)字符需要等待三四秒鐘的龜速還是無(wú)法忍受的,所以下一步想將第一個(gè)腳本的工作用java重寫(xiě)一下,在web上管理xml文件。
轉(zhuǎn)載于:https://www.cnblogs.com/jiama/p/3314147.html
總結(jié)
以上是生活随笔為你收集整理的数据下载工作笔记三:脚本的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: Android Animation学习(
- 下一篇: 经典面试题 之 子数组之和最大值