當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

[爬虫][python][入门][网页源码][百度图片][豆瓣TOP250]

發(fā)布時間：2023/12/10 python 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 [爬虫][python][入门][网页源码][百度图片][豆瓣TOP250] 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Robots協(xié)議查看爬取規(guī)則遵守相關(guān)法律法規(guī)

Robots協(xié)議（也稱為爬蟲協(xié)議、機(jī)器人協(xié)議等）的全稱是“網(wǎng)絡(luò)爬蟲排除標(biāo)準(zhǔn)”（Robots Exclusion Protocol），網(wǎng)站通過Robots協(xié)議告訴爬蟲哪些頁面可以抓取，哪些頁面不能抓取。
robots.txt是搜索引擎中訪問網(wǎng)站的時候要查看的第一個文件。robots.txt文件告訴蜘蛛程序在服務(wù)器上什么文件是可以被查看的。

抓取某網(wǎng)頁源碼

輸入網(wǎng)址后若失敗即不允許爬蟲
如輸入網(wǎng)址后只在瀏覽器中打開頁面請將光標(biāo)重新移動到末端點(diǎn)擊空格后按回車

import requests # python http客戶端庫，編寫爬蟲和測試服務(wù)器響應(yīng)數(shù)據(jù)經(jīng)常會用到的import re # 導(dǎo)入正則表達(dá)式模塊，用于提取所需要的內(nèi)容import random # 隨機(jī)生成一個實(shí)數(shù)，它的取值范圍 [0,1]def spiderPic(html,keyword):print('正在查找：'+keyword+' 對應(yīng)的文件，正在從百度文件庫中下載文件，親稍等 .....')for addr in re.findall('"objURL":"(.*?)"',html,re.S):print('現(xiàn)在正在爬取URL地址：'+str(addr)[0:30]+' ....')try:pics = requests.get(addr,timeout=10) # 請求圖像的URL地址（最大時間10s）except requests.exceptions.ConnectionError:print('您當(dāng)前的URL地址請求錯誤！')continuefq = open('S:\\python\\search\\img\\'+(str(random.randrange(0,1000,4))+'.jpg'),'wb')fq.write(pics.content)fq.close()# python 主方法if __name__ == '__main__':print('太棒了！')word = input('請輸入你想要爬取的文件的關(guān)鍵詞：')result = requests.get('http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word='+word)# 調(diào)用函數(shù)spiderPic(result.text,word)

實(shí)戰(zhàn)

豆瓣電影250top

原文鏈接：https://blog.csdn.net/qq_36759224/article/details/101572275

成功運(yùn)行

import requests from lxml import etree import csv import re import time import osheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}def index_pages(number):url = 'https://movie.douban.com/top250?start=%s&filter=' % numberindex_response = requests.get(url=url, headers=headers)tree = etree.HTML(index_response.text)m_urls = tree.xpath("//li/div/div/a/@href")return m_urlsdef parse_pages(url):movie_pages = requests.get(url=url, headers=headers)parse_movie = etree.HTML(movie_pages.text)# 排名ranking = parse_movie.xpath("//span[@class='top250-no']/text()")# 電影名name = parse_movie.xpath("//h1/span[1]/text()")# 評分score = parse_movie.xpath("//div[@class='rating_self clearfix']/strong/text()")# 參評人數(shù)value = parse_movie.xpath("//span[@property='v:votes']/text()")number = [" ".join(['參評人數(shù)：'] + value)]# value = parse_movie.xpath("//a[@class='rating_people']")# string = [value[0].xpath('string(.)')]# number = [a.strip() for a in string]# print(number)# 類型value = parse_movie.xpath("//span[@property='v:genre']/text()")types = [" ".join(['類型：'] + value)]# 制片國家/地區(qū)value = re.findall('制片國家/地區(qū):(.*?) ', movie_pages.text)country = [" ".join(['制片國家:'] + value)]# 語言value = re.findall('語言:(.*?) ', movie_pages.text)language = [" ".join(['語言:'] + value)]# 上映時期value = parse_movie.xpath("//span[@property='v:initialReleaseDate']/text()")date = [" ".join(['上映日期：'] + value)]# 片長value = parse_movie.xpath("//span[@property='v:runtime']/text()")time = [" ".join(['片長：'] + value)]# 又名value = re.findall('又名:(.*?) ', movie_pages.text)other_name = [" ".join(['又名:'] + value)]# 導(dǎo)演value = parse_movie.xpath("//div[@id='info']/span[1]/span[@class='attrs']/a/text()")director = [" ".join(['導(dǎo)演:'] + value)]# 編劇value = parse_movie.xpath("//div[@id='info']/span[2]/span[@class='attrs']/a/text()")screenwriter = [" ".join(['編劇:'] + value)]# 主演value = parse_movie.xpath("//div[@id='info']/span[3]")performer = [value[0].xpath('string(.)')]# URLm_url = ['豆瓣鏈接：' + movie_url]# IMDb鏈接value = parse_movie.xpath("//div[@id='info']/a/@href")imdb_url = [" ".join(['IMDb鏈接：'] + value)]# 保存電影海報poster = parse_movie.xpath("//div[@id='mainpic']/a/img/@src")response = requests.get(poster[0])name2 = re.sub(r'[A-Za-z\:\s]', '', name[0])poster_name = str(ranking[0]) + ' - ' + name2 + '.jpg'dir_name = 'douban_poster'if not os.path.exists(dir_name):os.mkdir(dir_name)poster_path = dir_name + '/' + poster_namewith open(poster_path, "wb")as f:f.write(response.content)return zip(ranking, name, score, number, types, country, language, date, time, other_name, director, screenwriter, performer, m_url, imdb_url)def save_results(data):with open('douban.csv', 'a', encoding="utf-8-sig") as fp:writer = csv.writer(fp)writer.writerow(data)if __name__ == '__main__':num = 0for i in range(0, 250, 25):movie_urls = index_pages(i)for movie_url in movie_urls:results = parse_pages(movie_url)for result in results:num += 1save_results(result)print('第' + str(num) + '條電影信息保存完畢！')time.sleep(3)

總結(jié)

以上是生活随笔為你收集整理的[爬虫][python][入门][网页源码][百度图片][豆瓣TOP250]的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： SQL Server 置疑修复
下一篇： Nginx篇05-http长连接和kee