當前位置：首頁 > 编程语言 > python >内容正文

python

python抓取静态网页

發(fā)布時間：2023/12/20 python 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 python抓取静态网页小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

lofter的同人文都是一篇一篇的，懶得找，所以就花了點時間寫個爬蟲，爬取文本數(shù)據(jù)存儲成本地text。這里主要通過lofter的作者專區(qū)文章搜索接口地址進行爬取數(shù)據(jù)。

示例：我是走高冷路線的? ? 該作者的文章搜索地址為：http://sanliubixian.lofter.com/search?q=

后面輸入文章名就能搜索到該作者對應(yīng)的文章。而且還有一個特點，她的文章順序是根據(jù)序號來的，如征服欲1，征服欲2?...這樣，我們就可以進行循環(huán)爬取數(shù)據(jù)了。

1.準備工作

前面踩了很多坑，這里也不一一詳細敘述了。我的本地python版本是2.7的。這個注意一下，因為2.7和3.x有一些區(qū)別。在這里最主要的區(qū)別是使用的urllib模塊。這里可以參考一下這位博主。

python 2.xx使用import urllib.request報錯no module named request_典笛安的博客-CSDN博客?

第二個就是安裝web模塊?，pip install web.py即可安裝。

第三個就是編碼問題，這里建議使用python的開發(fā)工具，我用的是submit text。

其他的就沒了，反正就一個py文件，直接上代碼吧

?index.py

#!/usr/bin/python # -*- coding: UTF-8 -*-import re import urllib import urllib2 import web import json urls = ('/', 'hello' ) app = web.application(urls, globals())# 定義函數(shù) def gettext( i ):url = 'http://sanliubixian.lofter.com/search?q='keyword = i.encode(encoding='utf-8')key_code = urllib.quote(keyword) # 對請求進行編碼url_all = url+key_codeheader = {'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'} #頭部信息request = urllib2.Request(url_all,headers=header)reponse = urllib2.urlopen(request).read()from bs4 import BeautifulSouphtml_doc = reponse;#創(chuàng)建一個BeautifulSoup解析對象soup = BeautifulSoup(html_doc.replace(' ', ' '),"html.parser",from_encoding="utf-8")#獲取文本title = soup.find('h2')print titleif title==None:print "全文數(shù)據(jù)抓取完成！！！"return "false"else:p_nodes = soup.find_all('p')fh = open("./"+title.get_text()+".txt","wb") # 將文件寫入到當前目錄中fh.write(title.get_text().encode(encoding='utf-8'))fh.write('\r\n')for p_node in p_nodes:#print p_node.get_text()fh.write(p_node.get_text().encode(encoding='utf-8'))fh.write('\r\n')fh.close()print "抓取："+title.get_text().encode(encoding='utf-8')return "true"class hello:def __init__(self):web.header('content-type', 'text/json')web.header('Access-Control-Allow-Origin', '*')web.header('Access-Control-Allow-Methods', 'GET, POST')def GET(self):i = web.input(name=None)for num in range(1,30):s=i.name+str(num)result=gettext(s)if result=="false":break'''t={'msg':'開始爬取數(shù)據(jù)...','title':title.get_text()}s={}s['data']=treturn json.dumps(s,ensure_ascii=False)'''def POST(self):a = int(web.input().a)b = int(web.input().b)return a + bif __name__ == "__main__":app.run()

運行結(jié)果：

?碼字踩坑不易，轉(zhuǎn)載請注明出處！！謝謝！！

lz初次接觸python，自己找資料自己看文檔寫的，如有不專業(yè)之處，還請專業(yè)人士見諒

總結(jié)

以上是生活随笔為你收集整理的python抓取静态网页的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：错误检测的奇偶校验方法
下一篇：魔力宝贝服务器没响应连接超时,腾讯内容开

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

python抓取静态网页

1.準備工作

總結(jié)