...python の 学习
5.14
...上次學python 好像是一個月前..
寫點東西記錄下叭..
現在在看李老大寫的博客寫..可能直接開抄代碼...
感覺自己寫的總是爬不成功,之前寫的爬豆瓣影評的爬蟲還是殘的...
?
1.最簡單的爬取一個網頁
import urllib2html = urllib2.urlopen('http://music.163.com/')
print html.read()
?
2.把爬取到的網頁存起來
可是好像因為之前用了那個網頁映射工具,現在生成 的 html 里面是當前目錄下的東西,而不是自己爬的那個網頁里面的內容...sigh..
import urllib2 response = urllib2.urlopen('http://music.163.com/') html = response.read() open('testt.html',"w").write(html)?
3.爬取ZOL 的一張壁紙
import urllib2 import re # 正則表達式所用到的庫# 我們所要下載的圖片所在網址 url = 'http://desk.zol.com.cn/bizhi/6377_78500_2.html' response = urllib2.urlopen(url) # 獲取網頁內容 html = response.read()# 確定一個正則表達式,用來找到圖片的所在地址 reg = re.compile(r'<img id="bigImg" src="(.*?jpg)" .*>'); imgurl = reg.findall(html)[0]# 打開圖片并保存為haha.jpg imgsrc = urllib2.urlopen(imgurl).read() open("haha.jpg","w").write(imgsrc)是直接抄的老大的代碼
然而我爬出來的壁紙是這樣的
不懂啊
解決了......?
http://m.ithao123.cn/content-6589593.html
應該用 "wb" 去打開文件
import urllib2 import reurl = 'http://desk.zol.com.cn/bizhi/6377_78500_2.html' response = urllib2.urlopen(url)html = response.read()reg = re.compile(r'<img id="bigImg" src="(.*?jpg)" .*>'); imgurl = reg.findall(html)[0]imgsrc = urllib2.urlopen(imgurl).read() open("haha.jpg","wb").write(imgsrc)然后
可以看到壁紙了,感人!!!
?
5.15
今天試了下李老大爬ZOL 壁紙 的代碼,爬出來的文件夾里面是空的啊...而且文件名是亂碼..
不過李老大說了那個只適用于linux
于是開始看 崔慶才 教程
?
1.爬取貼吧帖子
效果圖
?
?
?
?
然后在抄代碼的過程中遇到三個問題
1) print 中文的時候會報錯
一種解決方案是 這個
#!/usr/bin/python#coding:utf-8這篇博客講的
2)然后在改了上面那個問題后
會報錯,像下面這樣
解決辦法就是圖里面說的這樣
?
3.最后一個問題 就是 貼吧改版了
要換成?
<h3 class="core_title_txt".*?>(.*?)</h3>最后代碼是這樣的 #!/usr/bin/python #coding:utf-8 __author__ = 'CQC' # -*- coding:utf-8 -*-import urllib import urllib2 import reclass Tool:removeImg = re.compile('<img.*?>| {7}|')removeAddr = re.compile('<a.*?>|</a>')replaceLine = re.compile('<tr>|<div>|</div>|</p>')replaceTD= re.compile('<td>')replacePara = re.compile('<p.*?>')replaceBR = re.compile('<br><br>|<br>')removeExtraTag = re.compile('<.*?>')def replace(self,x):x = re.sub(self.removeImg,"",x)x = re.sub(self.removeAddr,"",x)x = re.sub(self.replaceLine,"\n",x)x = re.sub(self.replaceTD,"\t",x)x = re.sub(self.replacePara,"\n ",x)x = re.sub(self.replaceBR,"\n",x)x = re.sub(self.removeExtraTag,"",x)return x.strip()class BDTB:def __init__(self,baseUrl,seeLZ,floorTag):self.baseURL = baseUrlself.seeLZ = '?see_lz='+str(seeLZ)self.tool = Tool()self.file = Noneself.floor = 1self.defaultTitle = u"百度貼吧"self.floorTag = floorTagdef getPage(self,pageNum):try:url = self.baseURL+ self.seeLZ + '&pn=' + str(pageNum)request = urllib2.Request(url)response = urllib2.urlopen(request)return response.read().decode('utf-8','ignore')except urllib2.URLError, e:if hasattr(e,"reason"):print u"連接百度貼吧失敗,錯誤原因",e.reasonreturn Nonedef getTitle(self,page):pattern = re.compile('<h3 class=core_title_txt.*?>(.*?)</h3>',re.S)result = re.search(pattern,page)if result:return result.group(1).strip()else:return Nonedef getPageNum(self,page):pattern = re.compile('<li class="l_reply_num.*?</span>.*?<span.*?>(.*?)</span>',re.S)result = re.search(pattern,page)if result:return result.group(1).strip()else:return Nonedef getContent(self,page):pattern = re.compile('<div id="post_content_.*?>(.*?)</div>',re.S)items = re.findall(pattern,page)contents = []for item in items:content = "\n"+self.tool.replace(item)+"\n"contents.append(content.encode('utf-8'))return contentsdef setFileTitle(self,title):if title is not None:self.file = open(title + ".txt","w+")else:self.file = open(self.defaultTitle + ".txt","w+")def writeData(self,contents):for item in contents:if self.floorTag == '1':floorLine = "\n" + str(self.floor) + u"-----------------------------------------------------------------------------------------\n"self.file.write(floorLine)self.file.write(item)self.floor += 1def start(self):indexPage = self.getPage(1)pageNum = self.getPageNum(indexPage)title = self.getTitle(indexPage)self.setFileTitle(title)if pageNum == None:print "URL已失效,請重試"returntry:print "該帖子共有" + str(pageNum) + "頁"for i in range(1,int(pageNum)+1):print "正在寫入第" + str(i) + "頁數據"page = self.getPage(i)contents = self.getContent(page)self.writeData(contents)except IOError,e:print "寫入異常,原因" + e.messagefinally:print "寫入任務完成"print u"請輸入帖子代號" baseURL = 'http://tieba.baidu.com/p/' + str(raw_input(u'http://tieba.baidu.com/p/')) seeLZ = raw_input("是否只獲取樓主發言,是輸入1,否輸入0\n") floorTag = raw_input("是否寫入樓層信息,是輸入1,否輸入0\n") bdtb = BDTB(baseURL,seeLZ,floorTag) bdtb.start()
還不懂原理,再看
?
5.18
爬取貼吧內容的 一個類
#!/usr/bin/python #coding:utf-8 import urllib import urllib2 import reclass bdtb:def __init__(self,baseurl,seelz):self.baseurl = baseurlself.seelz = '?see_lz='+str(seelz)def getPage(self,pagenum):try:url = self.baseurl + self.seelz + '&pn=' + str(pagenum)request = urllib2.Request(url)response = urllib2.urlopen(request)print response.read()return responseexcept urllib2.URLError,e:if hasattr(e,"reason"):print u"連接百度貼吧失敗,錯誤原因",e.reasonreturn Nonebaseurl = 'http://tieba.baidu.com/p/3138733512' bb = bdtb(baseurl,1) bb.getPage(1)?
?5.19
模擬登陸學校的信息門戶
要用 ie 才能夠看到成績,但是看不到表單,就是 form data
這個時候再用回搜狗
#coding=utf-8import urllib import urllib2 import cookielib import reclass CHD:def __init__(self):self.loginUrl = 'http://bksjw.chd.edu.cn/loginAction.do'self.cookies = cookielib.CookieJar()self.postdata = urllib.urlencode({'dllx':dldl'zjh':xxxx'mm':xxxx})self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookies))def getPage(self):request = urllib2.Request(url = self.loginUrl,data = self.postdata)result = self.opener.open(request)print result.read().decode('gbk')chd = CHD()chd.getPage()?
?
轉載于:https://www.cnblogs.com/wuyuewoniu/p/5491979.html
總結
以上是生活随笔為你收集整理的...python の 学习的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 操作系统开发系列—13.g.操作系统的系
- 下一篇: angularJs中ngModel的坑