中国计算机学会通讯下载工具(简易爬虫)
CCCF
《中國計算機學會通訊》月刊(Communications of the CCF, 簡稱CCCF)由中國計算機學會主辦,高等教育出版社出版,面向計算機專業人士及信息領域的相關人士。雜志利用學會的學術優勢,組織信息技術各個領域最有影響的專家撰稿,全面、宏觀介紹計算機科學技術發展的最新發展狀況,預測未來技術發展趨勢,可以幫助讀者更加開闊視野,了解IT最前沿的動態,把握IT發展方向,具有權威性和指導性,適合與計算機相關的科研、教學,以及產業和管理等各方面的人士閱讀。
地址:http://www.ccf.org.cn/sites/ccf/zgjsjxhtx.jsp
下載問題
先來看下下載過程,進入CCCF的頁面之后,如果我們想下載某一期通訊下邊的文章,就要點擊期刊》點擊標題》點擊下載》然后就會得到一個類似于“0.pdf”,“1.pdf”的文件。
假設一期通訊中有15篇,要下載全部文章的話,要點擊幾十次鼠標,下載之后的文件名也都是數字,如果要逐個修改的話,也要花費一定時間。
解決方案(2個程序):
(1)提供期刊id,自動下載該期刊下的全部文章;
GitHub:https://github.com/cheesezhe/ccf_crawler?(里邊有幫助文檔)
(2)自動下載全部期刊的全部文章,直接運行源代碼就行;
下載地址:http://pan.baidu.com/s/1gdWOTt5 密碼:6212
源代碼:
1 #!/usr/bin/env python 2 #-*-coding:utf-8-*- 3 __author__ = 'ZhangHe' 4 import urllib2,re,os,httplib,urllib 5 6 7 def download_by_paper_url(src_url, dest_file): 8 """ 9 根據paper鏈接src_url下載文件并保存為dest_file 10 :param src_url: 11 :param dest_file: 12 :return: 13 """ 14 f = urllib2.urlopen(src_url) 15 try: 16 data = f.read() 17 except httplib.IncompleteRead as e: 18 with open('err_log.txt','a+') as err:#錯誤日志信息 19 err.write("%s %s\n"%(src_url,err)) 20 print 'Error' 21 return -1 22 with open(dest_file, "wb") as code: 23 code.write(data) 24 25 26 def parse_data_from_journal_url(src_url): 27 """ 28 根據期刊鏈接獲取paper的urls,titles和期刊名字 29 :param src_url: 30 :return:[paper_urls, paper_titles, journal_name] 31 """ 32 request = urllib2.Request(src_url) 33 response = urllib2.urlopen(request) 34 content = response.read().decode('utf-8') 35 36 print 'parsing paper IDs...' 37 pattern_str1 = '<a target=.*?title=.*?href=.*?contentId=(.*?)">' 38 pattern_str2 = '<span id=.*?class="cfqwz">(.*?)</span>' 39 pattern_str3 = '<title>(.*?)-.*?</title>' 40 pattern1 = re.compile(pattern_str1, re.S) 41 pattern2 = re.compile(pattern_str2, re.S) 42 pattern3 = re.compile(pattern_str3, re.S) 43 ids = re.findall(pattern1, content) 44 titles = re.findall(pattern2, content) 45 name = re.findall(pattern3, content) 46 47 return [ids, titles, name[0].strip()] 48 49 50 def get_url_by_paper_id(id): 51 """ 52 根據paperid獲取下載鏈接 53 :param src_url: 54 :return: 55 """ 56 src_url = 'http://www.ccf.org.cn/sites/ccf/freexiazai.jsp?contentId='+str(id) 57 request = urllib2.Request(src_url) 58 response = urllib2.urlopen(request) 59 content = response.read().decode('utf-8') 60 61 pattern_str = 'class=""><a href="(.*?)">.*?</a></span>' 62 pattern = re.compile(pattern_str, re.S) 63 urls = re.findall(pattern, content) 64 # 65 #If there is no url, return -1 66 if len(urls) < 1: 67 return -1 68 # 69 #process Chinese words in url 70 tmps = urls[0].split('/') 71 l = len(tmps) 72 tmps[l-1] = urllib.quote(tmps[l-1].encode('utf-8')) 73 tmp = '' 74 # 75 # or tmp = '/'.join(tmps) 76 for i in tmps: 77 tmp += '/'+i 78 return 'http://www.ccf.org.cn/sites/ccf/download.jsp?file='+tmp 79 80 81 def get_all_journals_ids(): 82 """ 83 獲取所有期刊對應的的id 84 """ 85 urls = [ 86 'http://www.ccf.org.cn/sites/ccf/zgjsjxhtx.jsp?jportal=SFXxdDjYKXLl06cz1fxjkzihsqP9JcoP',#89-118期 87 'http://www.ccf.org.cn/sites/ccf/zgjsjxhtx.jsp?jportal=SFXxdDjYKXLl06cz1fxjk%2FySA9FzIG2g',#59-88期 88 'http://www.ccf.org.cn/sites/ccf/zgjsjxhtx.jsp?jportal=SFXxdDjYKXLl06cz1fxjk7R3hW0kV5Np',#29-58期 89 'http://www.ccf.org.cn/sites/ccf/zgjsjxhtx.jsp?jportal=SFXxdDjYKXLl06cz1fxjk%2BP28%2Bg%2BBW1u'#01-28期 90 ] 91 res = [] 92 93 for src_url in urls: 94 print 'processing\t'+src_url 95 request = urllib2.Request(src_url) 96 response = urllib2.urlopen(request) 97 content = response.read().decode('utf-8') 98 99 pattern_str = '<li id="(.*?)">.*?<a target=' 100 pattern = re.compile(pattern_str, re.S) 101 ids = re.findall(pattern, content) 102 res.extend(ids) 103 return res 104 105 106 def get_all_done_papers_ids(): 107 """ 108 獲取所有已下載文章的id列表 109 :return: 110 """ 111 dl_ids = [] 112 with open('dl_list.txt','r') as dl:#已下載文章id 113 for i in dl: 114 dl_ids.append(i.strip()) 115 return dl_ids 116 117 118 def get_all_done_journals_ids(): 119 """ 120 獲取全部已下載期刊對應的id列表 121 :return: 122 """ 123 dl_j = [] 124 with open('dl_list_j.txt','r') as dl:#已下載期刊id 125 for i in dl: 126 dl_j.append(i.strip()) 127 return dl_j 128 129 130 def create_new_directory(dir_name): 131 """ 132 創建一個文件夾,文件夾名為dir_name 133 :param dir_name: 134 :return: 135 """ 136 try: 137 os.mkdir(dir_name) 138 except WindowsError as e: 139 pass 140 141 142 def get_paper_title(origin_title): 143 """ 144 格式化文章標題 145 :param origin_title: 146 :return ret: 147 """ 148 ret = origin_title.strip() 149 ret = ret.replace('/','-') 150 ret = ret.replace('?','') 151 ret = ret.replace('*','_x_') 152 return ret 153 154 if __name__ == '__main__': 155 """ 156 Step 1:獲取期刊id列表,已下載期刊id列表,已下載文章id列表 157 """ 158 all_journals_ids = get_all_journals_ids() 159 all_done_journals_ids = get_all_done_journals_ids() 160 all_done_papers_ids = get_all_done_papers_ids() 161 162 """ 163 Step 2:遍歷期刊id列表,逐個處理 164 """ 165 for journal_id in all_journals_ids: 166 # 167 #如果已下載當前期刊,則跳過 168 if journal_id in all_done_journals_ids: 169 print '%s has been downloaded.'%(journal_id) 170 continue 171 # 172 #根據期刊id,獲取解析數據ret_data = [文章id列表,文章標題列表,期刊名] 173 journal_url = 'http://www.ccf.org.cn/sites/ccf/jsjtbbd.jsp?contentId='+journal_id 174 ret_data = parse_data_from_journal_url(journal_url) 175 print 'Start Download %s\t %s' % (journal_id, ret_data[2]) 176 # 177 #根據期刊名創建目錄 178 create_new_directory(ret_data[2]) 179 finished = 0 180 """ 181 Step 3:遍歷ret_data中的文章id列表,逐個處理 182 """ 183 for idx in xrange(len(ret_data[0])): 184 paper_id = ret_data[0][idx] 185 # 186 #如果文章paper_id已下載,則跳過 187 if paper_id in all_done_papers_ids: 188 print 'Paper %s has been downloaded.' % paper_id 189 finished += 1 190 continue 191 # 192 #根據paper_id獲得下載鏈接 193 title = get_paper_title(ret_data[1][idx]) 194 print 'Downloading (%s/%s) ID:%s Title:%s' % (str(idx+1), str(len(ret_data[0])), paper_id, title) 195 target_url = get_url_by_paper_id(paper_id) 196 # 197 # if target_url is -1, it means there is no url 198 # for paper_id(this is very special situation) 199 if target_url == -1: 200 print 'There is no url for paper %s' % paper_id 201 finished += 1 202 continue 203 """ 204 Step 4:根據下載鏈接,下載文件 205 """ 206 dl_result = download_by_paper_url(target_url, ret_data[2]+'\\'+title+'.pdf') 207 if dl_result != -1: 208 finished += 1 209 with open('dl_list.txt', 'a+') as dl:#存儲已下載的文章的id 210 dl.write(paper_id+'\n') 211 else: 212 with open('err_list.txt', 'a+') as err:#存儲下載失敗的期刊id 文章id 213 err.write(journal_id+' '+paper_id+'\n') 214 if finished == len(ret_data[0]): 215 with open('dl_list_j.txt', 'a+') as dl:#存儲已下載期刊的id 216 dl.write(journal_id+'\n') 217 print 'All finished.'?
總結
以上是生活随笔為你收集整理的中国计算机学会通讯下载工具(简易爬虫)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: rsync配置参考
- 下一篇: Codeforces Round #25