當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

scrapy实战之爬取简书

發布時間：2024/1/1 编程问答 40 豆豆

生活随笔收集整理的這篇文章主要介紹了 scrapy实战之爬取简书小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

這一節，我們利用scrapy來爬取簡書整站的內容。對于一篇文章詳情頁面，我們發現許多內容是Ajax異步加載的，所以使用傳統方式返回的response里并沒有我們想要的數據，例如評論數，喜歡數等等。對于動態數據請求，我們使用selenium+chromedriver來完成

1.到淘寶鏡像https://npm.taobao.org/mirrors/chromedriver選擇對應的chromedriver。將解壓后的chromedriver.exe放到chrome瀏覽器的安裝目錄下
2.安裝 selenium pip install selenium

整個爬蟲的執行流程

首先從start_urls 里取出要執行的請求地址
返回的內容經過下載中間件進行處理（selenium加載動態數據）
經過中間件處理的數據（response）返回給爬蟲進行提取數據
提取出的數據返回給pipeline進行存儲

創建爬蟲項目

1.進入到虛擬環境下
scrapy startproject jianshu
2.進入項目(jianshu)下，新建spider，由于我們是整站爬蟲，所以我們可以指定crawl模板，利用里面的rule來方便爬取
scrapy genspider -t crawl js jianshu.com
此時spiders文件夾下多了一個js.py

在items.py中定義字段

import scrapyclass JianshuItem(scrapy.Item):title = scrapy.Field()content = scrapy.Field()article_id = scrapy.Field()origin_url = scrapy.Field()author = scrapy.Field()avatar = scrapy.Field()pub_time = scrapy.Field()read_count = scrapy.Field()like_count = scrapy.Field()word_count = scrapy.Field()subjects = scrapy.Field()comment_count = scrapy.Field()

定義下載中間件

request請求和response響應都是要經過下載中間件，所以我們在這里將selenium集成到scrapy的中間件下載器中。定義完中間件之后，一定要在setting中開啟。

# middlewares.py from scrapy import signals from selenium import webdriver import time from scrapy.http.response.html import HtmlResponseclass SeleniumDownloadMiddleware(object):def __init__(self):self.driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")def process_request(self, request, spider):self.driver.get(request.url)time.sleep(2)try:while True:showMore = self.driver.find_element_by_class_name('show-more') #獲取標簽showMore.click()time.sleep(0.5)if not showMore:breakexcept:passsource = self.driver.page_sourceresponse = HtmlResponse(url=self.driver.current_url, body=source, request=request,encoding='utf-8')return response DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36' }

爬蟲的設計

發現簡書上的文章的頁面的url滿足特點的規則：
https://www.jianshu.com/p/d65909d2173a
前面是域名，后面是uid，這樣可以通過crawl模板來制定Rule。

# js.py import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from jianshu.items import JianshuItemclass JsSpider(CrawlSpider):name = 'js'allowed_domains = ['jianshu.com']start_urls = ['http://jianshu.com/']rules = (Rule(LinkExtractor(allow=r'.*/p/[0-9a-z]{12}.*'), callback='parse_detail', follow=True),)def parse_detail(self, response):title = response.xpath("//h1[@class='title']/text()").get()avatar = response.xpath("//a[@class='avatar']/img/@src").get()author = response.xpath("//span[@class='name']/a/text()").get()pub_time = response.xpath("//span[@class='publish-time']/text()").get().replace("*","")url = response.urlurl1 = url.split("?")[0]article_id = url1.split('/')[-1]content = response.xpath("//div[@class='show-content']").get()word_count_list = response.xpath("//span[@class='wordage']/text()").get().split(' ') #字數 10000word_count = int(word_count_list[-1])comment_count_list = response.xpath("//span[@class='comments-count']/text()").get().split(' ') #評論 427comment_count = int(comment_count_list[-1])read_count_list = response.xpath("//span[@class='views-count']/text()").get().split(' ') #喜歡 427read_count=int(read_count_list[-1])like_count_list = response.xpath("//span[@class='likes-count']/text()").get().split(' ') #喜歡 3like_count = int(like_count_list[-1])subjects = ",".join(response.xpath("//div[@class='include-collection']/a/div/text()").getall())item = JianshuItem(title=title,avatar=avatar,author=author,pub_time=pub_time,origin_url=response.url,article_id=article_id,content=content,subjects=subjects,word_count=word_count,comment_count=comment_count,read_count=read_count,like_count=like_count,)print('y' * 100)yield item

pipelines的設計

這里是將爬蟲（js.py）中返回的item保存到數據庫中。下面代碼操作數據庫是同步的。

import pymysqlclass JianshuPipeline(object):def __init__(self):dbparams = {'host':'127.0.0.1','port':3306,'user':'root','password':'123456','database':'jianshu','charset':'utf8'}self.conn = pymysql.connect(**dbparams)self.cursor = self.conn.cursor()self._sql = Nonedef process_item(self,item,spider):self.cursor.execute(self.sql,(item['title'],item['content'],item['author'],item['avatar'],item['pub_time'],item['origin_url'],item['article_id'],item['read_count'],item['like_count'],item['word_count'],item['subjects'],item['comment_count']))self.conn.commit()return item@propertydef sql(self):if not self._sql:self._sql = """insert into js(id,title,content,author,avatar,pub_time,origin_url,article_id,read_count,like_count,word_count,subjects,comment_count) values (null,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""return self._sqlreturn self._sql

setting.py

為了使上面定義的中間件起作用，必須在setting中開啟中間件

設置user-agent
關閉robot協議
設置合理地下載延遲，否則會被服務器禁用 ROBOTSTXT_OBEY = False DOWNLOAD_DELAY = 3 DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36' } DOWNLOADER_MIDDLEWARES = {'jianshu.middlewares.SeleniumDownloadMiddleware': 543, } ITEM_PIPELINES = {'jianshu.pipelines.JianshuPipeline': 300, } ......
?
開啟下載器中間件
開啟管道

總結

以上是生活随笔為你收集整理的scrapy实战之爬取简书的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：计算机专业假期赚钱,计算机科学与技术
下一篇： LeetCode 807. 保持城市天际