生活随笔
收集整理的這篇文章主要介紹了
注意scrapy中SgmlLinkExtractor的默认deny_extensions
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
在使用scrapy做爬蟲的時候碰到一個問題,耗了挺長時間都沒有解決,關(guān)鍵是從日志里面沒有看出問題,最后還是通過閱讀源碼才找出問題所在。在此將問題現(xiàn)象以及解決方法記錄一下。
現(xiàn)象:
在一個頁面中有n多的連接,url的正則表達式如下:r"en/descriptions/[\d]+/[-:\.\w]+$",大部分連接都能抓取下來,但部分如 en/descriptions/32725456/not-a-virus:Client-SMTP.Win32.Blat.ai, en/descriptions/33444568/not-a-virus:Client-SMTP.Win32.Blat.au的卻抓取不到,日志中沒有任何提示信息。
分析:
首先是懷疑CrawlSpider的Rule定義有問題,但經(jīng)過測試發(fā)現(xiàn)Rule的定義正常;此時只能懷疑是SgmlLinkExtractor的定義的問題了,通過SgmlLinkExtractor的process_value的回調(diào)跟蹤來看,SgmlLinkExtractor在解析link的時候沒有問題,但最終返回時卻將部分link丟棄。
通過源碼發(fā)現(xiàn),scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor在默認情況下將deny_extensions 設置為scrapy.linkextractor.IGNORED_EXTENSIONS,SgmlLinkExtractor在extract_links的時候調(diào)用_process_links, _process_links又調(diào)用了_link_allowed,在_link_allowed中依據(jù)種種過則對所有的link進行過濾,過濾規(guī)則中就有deny_extensions。默認IGNORED_EXTENSIONS將ai,au都包含了。所以也就出現(xiàn)了ai,au為結(jié)尾的link被過濾。至此真正的問題出處算是找到了。
解決方式:
根據(jù)源碼分析的結(jié)果,在定義SgmlLinkExtractor時重新定義deny_extensions即可。比如
rules =
(Rule(SgmlLinkExtractor(allow=(r
"en/descriptions\?", )), follow =
True, ),Rule(SgmlLinkExtractor(allow=(r
"en/descriptions/[\d]+/[-:\.\w]+$", ), deny_extensions =
""), callback =
"parse_item", follow =
True),) scrapy部分相關(guān)源碼如下:
scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor:
class SgmlLinkExtractor(BaseSgmlLinkExtractor):def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=
(), tags=(
'a',
'area'), attrs=(
'href'), canonicalize=True, unique=True, process_value=
None,deny_extensions=
None):self.allow_res = [x
if isinstance(x, _re_type)
else re.compile(x)
for x
in arg_to_iter(allow)]self.deny_res = [x
if isinstance(x, _re_type)
else re.compile(x)
for x
in arg_to_iter(deny)]self.allow_domains =
set(arg_to_iter(allow_domains))self.deny_domains =
set(arg_to_iter(deny_domains))self.restrict_xpaths =
tuple(arg_to_iter(restrict_xpaths))self.canonicalize =
canonicalizeif deny_extensions
is None:deny_extensions =
IGNORED_EXTENSIONSself.deny_extensions = set([
'.' + e
for e
in deny_extensions])tag_func =
lambda x: x
in tagsattr_func =
lambda x: x
in attrsBaseSgmlLinkExtractor.__init__(self, tag=tag_func, attr=
attr_func, unique=unique, process_value=
process_value)def extract_links(self, response):base_url =
Noneif self.restrict_xpaths:hxs =
HtmlXPathSelector(response)html =
''.join(
''.join(html_fragm
for html_fragm
in hxs.select(xpath_expr).extract()) \for xpath_expr
in self.restrict_xpaths)base_url =
get_base_url(response)else:html =
response.bodylinks =
self._extract_links(html, response.url, response.encoding, base_url)links =
self._process_links(links)return linksdef _process_links(self, links):links = [x
for x
in links
if self._link_allowed(x)]links =
BaseSgmlLinkExtractor._process_links(self, links)return linksdef _link_allowed(self, link):parsed_url =
urlparse(link.url)allowed =
_is_valid_url(link.url)if self.allow_res:allowed &=
_matches(link.url, self.allow_res)if self.deny_res:allowed &=
not _matches(link.url, self.deny_res)if self.allow_domains:allowed &=
url_is_from_any_domain(parsed_url, self.allow_domains)if self.deny_domains:allowed &=
not url_is_from_any_domain(parsed_url, self.deny_domains)if self.deny_extensions:allowed &=
not url_has_any_extension(parsed_url, self.deny_extensions)if allowed
and self.canonicalize:link.url =
canonicalize_url(parsed_url)return alloweddef matches(self, url):if self.allow_domains
and not url_is_from_any_domain(url, self.allow_domains):return Falseif self.deny_domains
and url_is_from_any_domain(url, self.deny_domains):return Falseallowed = [regex.search(url)
for regex
in self.allow_res]
if self.allow_res
else [True]denied = [regex.search(url)
for regex
in self.deny_res]
if self.deny_res
else []return any(allowed)
and not any(denied)
scrapy.linkextractor.IGNORED_EXTENSIONS:
IGNORED_EXTENSIONS =
[# images'mng',
'pct',
'bmp',
'gif',
'jpg',
'jpeg',
'png',
'pst',
'psp',
'tif','tiff',
'ai',
'drw',
'dxf',
'eps',
'ps',
'svg',# audio'mp3',
'wma',
'ogg',
'wav',
'ra',
'aac',
'mid',
'au',
'aiff',# video'3gp',
'asf',
'asx',
'avi',
'mov',
'mp4',
'mpg',
'qt',
'rm',
'swf',
'wmv','m4a',# office suites'xls',
'ppt',
'doc',
'docx',
'odt',
'ods',
'odg',
'odp',# other'css',
'pdf',
'doc',
'exe',
'bin',
'rss',
'zip',
'rar',
]
轉(zhuǎn)載于:https://www.cnblogs.com/Jerryshome/archive/2012/10/23/2735129.html
總結(jié)
以上是生活随笔為你收集整理的注意scrapy中SgmlLinkExtractor的默认deny_extensions的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。