當前位置：首頁 > 编程语言 > python >内容正文

python

用python写简单爬虫,用Python写简单的爬虫

發布時間：2023/12/4 python 20 豆豆

生活随笔收集整理的這篇文章主要介紹了用python写简单爬虫,用Python写简单的爬虫小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

準備：

1.扒網頁，根據URL來獲取網頁信息

importurllib.parseimporturllib.request

response= urllib.request.urlopen("https://www.cnblogs.com")print(response.read())

urlopen方法

urlopen(url, data, timeout)

url即為URL，data是訪問URL時要傳送的數據，timeout是設置超時時間

返回response對象

response對象的read方法，可以返回獲取到的網頁內容

POST方式

importurllib.parseimporturllib.request

values= {"username":"XXX","password":"XXX"}

data=urllib.parse.urlencode(values)

data= data.encode('utf-8')

url= "https://passport.cnblogs.com/user/signin?ReturnUrl=https://home.cnblogs.com/&AspxAutoDetectCookieSupport=1"response=urllib.request.urlopen(url,data)print(response.read())

GET方式

importurllib.parseimporturllib.request

values= {"itemCount":30}

data=urllib.parse.urlencode(values)

data= data.encode('utf-8')

url= "https://news.cnblogs.com/CommentAjax/GetSideComments"data=urllib.parse.urlencode(values)

response= urllib.request.urlopen(url+'?'+data)print(response.read())

2.正則表達式re模塊

Python 自帶了re模塊，提供了對正則表達式的支持

#返回pattern對象

re.compile(string[,flag])

#以下為匹配所用函數

re.match(pattern, string[, flags]) #在字符串中查找，是否能匹配正則表達式

re.search(pattern, string[, flags]) #字符串的開頭是否能匹配正則表達式

re.split(pattern, string[, maxsplit]) #通過正則表達式將字符串分離

re.findall(pattern, string[, flags]) #找到 RE 匹配的所有子串，并把它們作為一個列表返回

re.finditer(pattern, string[, flags]) #找到 RE 匹配的所有子串，并把它們作為一個迭代器返回

re.sub(pattern, repl, string[, count]) #找到 RE 匹配的所有子串，并將其用一個不同的字符串替換

re.subn(pattern, repl, string[, count])#返回 (sub(repl, string[, count]), 替換次數)

3.Beautiful Soup，是從網頁抓取數據的庫，使用時需要導入 bs4 庫

4.MongoDB

使用的MongoEngine庫

示例：

抓取博客園前20頁數據，保存到MongoDB中

1.獲取博客園的數據

request.py

importurllib.parseimporturllib.requestdefgetHtml(url,values):

data=urllib.parse.urlencode(values)

response_result= urllib.request.urlopen(url+'?'+data).read()

html= response_result.decode('utf-8')returnhtmldefrequestCnblogs(num):print('請求數據page:',num)

url= 'https://www.cnblogs.com/mvc/AggSite/PostList.aspx'values={'CategoryId':808,'CategoryType' : 'SiteHome','ItemListActionName' :'PostList','PageIndex': num,'ParentCategoryId': 0,'TotalPostCount' : 4000}

result=getHtml(url,values)return result

注：

打開第二頁，f12，找到https://www.cnblogs.com/mvc/AggSite/PostList.aspx

2.解析獲取來的數據

deal.py

from bs4 importBeautifulSoupimportrequestimportredefblogParser(index):

cnblogs=request.requestCnblogs(index)

soup= BeautifulSoup(cnblogs, 'html.parser')

all_div= soup.find_all('div', attrs={'class': 'post_item_body'}, limit=20)

blogs=[]#循環div獲取詳細信息

for item inall_div:

blog=analyzeBlog(item)

blogs.append(blog)returnblogsdefanalyzeBlog(item):

result={}

a_title= find_all(item,'a','titlelnk')if a_title is notNone:

result["title"] =a_title[0].string

result["link"] = a_title[0]['href']

p_summary= find_all(item,'p','post_item_summary')if p_summary is notNone:

result["summary"] =p_summary[0].text

footers= find_all(item,'div','post_item_foot')

footer=footers[0]

result["author"] =footer.a.string

str=footer.text

time= re.findall(r"發布于 .+? .+?", str)

result["create_time"] = time[0].replace('發布于','')returnresultdeffind_all(item,attr,c):return item.find_all(attr,attrs={'class':c},limit=1)

注：

分析html結構

3.將處理好的數據保存到MongoDB

db.py

from mongoengine import *connect('test', host='localhost', port=27017)importdatetimeclassBlogs(Document):

title= StringField(required=True, max_length=200)

link= StringField(required=True)

author= StringField(required=True)

summary= StringField(required=True)

create_time= StringField(required=True)defsavetomongo(contents):for content incontents:

blog=Blogs(

title=content['title'],

link= content['link'],

author=content['author'],

summary=content['summary'],

create_time=content['create_time']

)

blog.save()return "ok"

defhaveBlogs():

blogs=Blogs.objects.all()return len(blogs)

4.開始抓取數據

test.py

importdbimportdealprint("start.......")for i in range(1, 21):

contents=deal.blogParser(i)

db.savetomongo(contents)print('page',i,'OK.')

counts=db.haveBlogs()print("have",counts,"blogs")print("end.......")

注：

當前使用的Python版本是3.6.1

可以在可視化工具中查看(可是化工具?介紹?)

總結

以上是生活随笔為你收集整理的用python写简单爬虫,用Python写简单的爬虫的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

用python写简单爬虫,用Python写简单的爬虫

總結