用scrapy写爬虫(一)快速上手
                                                            生活随笔
收集整理的這篇文章主要介紹了
                                用scrapy写爬虫(一)快速上手
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.                        
                                寫在前面
用python寫爬蟲的人很多,python的爬蟲框架也很多,諸如pyspider 和 scrapy,筆者還是筆記傾向于scrapy,本文就用python寫一個小爬蟲demo。 本文適用于有一定python基礎的,并且對爬蟲有一定了解的開發者。
安裝 Scrapy
檢查環境,python的版本為3.6.2,pip為9.0.1
F:\techlee\python>python --version Python 3.6.2F:\techlee\python>pip --version pip 9.0.1 from d:\program files\python\python36-32\lib\site-packages (python 3.6)安裝scrapy框架
F:\techlee\python>pip install scrapy Collecting scrapyDownloading Scrapy-1.4.0-py2.py3-none-any.whl (248kB)100% |████████████████████████████████| 256kB 188kB/s// 漫長的安裝過程 Successfully installed Twisted-17.9.0 scrapy-1.4.0如果報錯:
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools請安裝Visual C++ 2015 Build Tools?http://landinghub.visualstudio.com/visual-cpp-build-tools
安裝完成
F:\techlee\python>scrapy version Scrapy 1.4.0創建項目
F:\techlee\python>scrapy startproject scrapyDemo New Scrapy project 'scrapyDemo', using template directory 'd:\\program files\\python\\python36-32\\lib\\site-packages\\scrapy\\templates\\project', created in:F:\techlee\python\scrapyDemoYou can start your first spider with:cd scrapyDemoscrapy genspider example example.com目錄結構
scrapyDemo/scrapy.cfg # 部署配置文件scrapyDemo/ # python模塊__init__.pyitems.py # 數據容器pipelines.py # project pipelines filesettings.py # 配置文件spiders/ # Spider類定義了如何爬取某個(或某些)網站__init__.py創建執行爬取的類ImoocSpider在?scrapyDemo/spiders中
# -*- coding: utf-8 -*- import scrapy from urllib import parse as urlparse# 慕課網爬取 class ImoocSpider(scrapy.Spider):# spider的名字定義了Scrapy如何定位(并初始化)spider,所以其必須是唯一的name = "imooc"# URL列表start_urls = ['http://www.imooc.com/course/list']# 域名不在列表中的URL不會被爬取。allowed_domains = ['www.imooc.com']def parse(self, response): learn_nodes = response.css('a.item')for learn_node in learn_nodes :learn_url = learn_node.css("::attr(href)").extract_first()yield scrapy.Request(url=urlparse.urljoin(response.url,learn_url),callback=self.parse_learn)def parse_learn(self, response):title = response.xpath('//h2[@class="l"]/text()').extract_first()content = response.xpath('//div[@class="course-brief"]/p/text()').extract_first()url = response.urlprint ('標題:' + title)print ('地址:' + url)開始爬取
F:\techlee\python\scrapyDemo>scrapy crawl imooc如果出現,則缺少win32api庫,選擇相應的版本
下載地址:https://sourceforge.net/projects/pywin32/files/pywin32/Build 221/
import win32api ModuleNotFoundError: No module named 'win32api'大功告成
看到如下輸出,就說明爬取成功啦
F:\techlee\python\scrapyDemo>scrapy crawl imooc 2017-10-17 14:28:32 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapyDemo) …… 2017-10-17 14:28:32 [scrapy.core.engine] INFO: Spider opened 2017-10-17 14:28:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-10-17 14:28:32 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-10-17 14:28:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/robots.txt> (referer: None) 2017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/course/list> (referer: None) 2017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/876> (referer: http://www.imooc.com/course/list) 標題:集成MultiDex項目實戰 地址:http://www.imooc.com/learn/876 2017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/893> (referer: http://www.imooc.com/course/list) 標題:阿里D2前端技術論壇——2016初心 地址:http://www.imooc.com/learn/893 2017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/890> (referer: http://www.imooc.com/course/list) 2017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/888> (referer: http://www.imooc.com/course/list) 標題:Hadoop進階 地址:http://www.imooc.com/learn/890 標題:Javascript實現二叉樹算法 地址:http://www.imooc.com/learn/888 2017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/894> (referer: http://www.imooc.com/course/list) 標題:Fragment應用上 地址:http://www.imooc.com/learn/894 2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/887> (referer: http://www.imooc.com/course/list) 標題:PHP-面向對象 地址:http://www.imooc.com/learn/887 2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/900> (referer: http://www.imooc.com/course/list) 2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/889> (referer: http://www.imooc.com/course/list) 2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/901> (referer: http://www.imooc.com/course/list) 標題:Sketch的基礎實例應用 地址:http://www.imooc.com/learn/900 標題:ElasticSearch入門 地址:http://www.imooc.com/learn/889 標題:使用Google Guice實現依賴注入 地址:http://www.imooc.com/learn/901 2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/867> (referer: http://www.imooc.com/course/list) 標題:Docker入門 地址:http://www.imooc.com/learn/867 2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/878> (referer: http://www.imooc.com/course/list) 標題:Android圖表繪制之直方圖 地址:http://www.imooc.com/learn/878 2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/892> (referer: http://www.imooc.com/course/list) 標題:UI版式設計 地址:http://www.imooc.com/learn/892 2017-10-17 14:28:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/877> (referer: http://www.imooc.com/course/list) 2017-10-17 14:28:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/886> (referer: http://www.imooc.com/course/list) 標題:RxJava與RxAndroid基礎入門 地址:http://www.imooc.com/learn/877 標題:iOS開發之Audio特輯 地址:http://www.imooc.com/learn/886 2017-10-17 14:28:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/861> (referer: http://www.imooc.com/course/list) 標題:基于Websocket的火拼俄羅斯(基礎) 地址:http://www.imooc.com/learn/861 2017-10-17 14:28:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/895> (referer: http://www.imooc.com/course/list) 2017-10-17 14:28:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/882> (referer: http://www.imooc.com/course/list) 標題:2017AWS 技術峰會——大數據技術專場 地址:http://www.imooc.com/learn/895 標題:基于websocket的火拼俄羅斯(單機版) 地址:http://www.imooc.com/learn/882總結
以上是生活随笔為你收集整理的用scrapy写爬虫(一)快速上手的全部內容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: 编写基于DM3730 ARM-A8内核测
- 下一篇: matlab 突破交易策略,Matlab
