关于起点中文网字体反爬的解决方法——以阅读指数榜为例
一、準(zhǔn)備階段
1.目標(biāo)網(wǎng)站觀察
①采集網(wǎng)頁(yè)屬于靜態(tài)網(wǎng)頁(yè)
②網(wǎng)站做了字體反爬,關(guān)鍵信息在html中是亂碼
二、關(guān)于起點(diǎn)中文網(wǎng)的字體反爬
1.什么是字體反爬
① 網(wǎng)站采取的一種反爬措施
② 通過(guò)自定義字體文件的方式,讓前端顯示正常,但在html中是亂碼
③ 一般自定義的字體文件是隨機(jī)的,每次請(qǐng)求都會(huì)改變。還有一種字體反爬更加變態(tài),在字體文件中還有一次隨機(jī),這樣子即使獲取到字體文件也很難找到字體Unicode與前端顯示本文的正確對(duì)應(yīng)關(guān)系,這種程度的反爬,我想應(yīng)該需要通過(guò)機(jī)器學(xué)習(xí)來(lái)解決,本文并沒(méi)有對(duì)此作答。
2.對(duì)起點(diǎn)中文網(wǎng)的字體反爬的解析
①通過(guò)對(duì)整個(gè)html的分析,找到了下載woff字體文件的網(wǎng)址,截取出最重要的部分
②其中code對(duì)應(yīng)的就是html頁(yè)面中亂碼的十六進(jìn)制編碼,name對(duì)應(yīng)的應(yīng)該就是數(shù)字的英文表達(dá)
③看到這里,思路其實(shí)就應(yīng)該清晰了。但這應(yīng)該算是最簡(jiǎn)單的字體反爬了,像58同城,還需要再找一組對(duì)應(yīng),最后結(jié)果應(yīng)該是GlyphOrder標(biāo)簽中的id減去1;貓眼電影應(yīng)該是最難的,屬于我上述的雙隨機(jī),可能需要KNN算法來(lái)解決。
三、代碼邏輯
1.通過(guò)請(qǐng)求起始頁(yè)面采集基本信息以及詳細(xì)頁(yè)面的url
2.在詳細(xì)頁(yè)面中利用re正則表達(dá)式解析得到字體下載文件和亂碼部分
3.利用fontTools.ttLib解析字體文件得到對(duì)應(yīng)camp
4.配合亂碼部分的十六進(jìn)制編碼,得到正確結(jié)果
5.scrapy crawl main -o res.csv將結(jié)果保存為csv
四、代碼部分
main.py
import scrapy import re import requests from fontTools.ttLib import TTFont from io import BytesIO import time from ..items import QidianxiaoshuoItemclass MainSpider(scrapy.Spider):name = 'main'# allowed_domains = ['main.com']# start_urls = ['https://www.qidian.com/rank/readIndex?page=1',# 'https://www.qidian.com/rank/readIndex?page=2']start_urls = [f'https://www.qidian.com/rank/readIndex?page={i}' for i in range(1,6)]def get_font(self,url):time.sleep(1)response = requests.get(url)font = TTFont(BytesIO(response.content))cmap = font.getBestCmap()font.close()return cmapdef get_encode(self,cmap, values):WORD_MAP = {'zero': '0', 'one': '1', 'two': '2', 'three': '3', 'four': '4', 'five': '5', 'six': '6','seven': '7','eight': '8', 'nine': '9', 'period': '.'}word_count = ''list = values.split(';')list.pop(-1)for value in list:value = value[2:]key = cmap[int(value)]word_count += WORD_MAP[key]return word_countdef get_nums(self,url):# 獲取當(dāng)前頁(yè)面的htmltime.sleep(1)response = requests.get(url).textpattern = re.compile('</style><span.*?>(.*?)</span>', re.S)# 獲取當(dāng)前頁(yè)面所有被字?jǐn)?shù)字符numberlist = re.findall(pattern, response)# 獲取當(dāng)前包含字體文件鏈接的文本reg = re.compile('<style>(.*?)\s*</style>', re.S)fonturl = re.findall(reg, response)[0]# 通過(guò)正則獲取當(dāng)前頁(yè)面字體文件鏈接url = re.search('woff.*?url.*?\'(.+?)\'.*?truetype', fonturl).group(1)cmap = self.get_font(url)print('cmap:', cmap)num_list = []for a in numberlist:num_list.append(self.get_encode(cmap, a))return num_listdef parse(self, response):res = response.xpath('//*[@id="rank-view-list"]/div/ul/li')for i in res:url = i.css('div:nth-child(1) a::attr(href)').extract_first()url = 'https:' + urlyield scrapy.Request(url,callback = self.parse_one,meta={'url':url})def parse_one(self,response):book_name = response.css('div.book-info h1 em::text').extract_first()author = response.css('a.writer::text').extract_first()intro = response.xpath('/html/body/div/div[6]/div[1]/div[2]/p[2]/text()').extract_first()num_list = self.get_nums(response.meta['url'])del num_list[1]word_num = str(num_list[0]) + response.xpath('/html/body/div/div[6]/div[1]/div[2]/p[3]/cite[1]/text()').extract_first()recommend_all = str(num_list[1]) + response.xpath('/html/body/div/div[6]/div[1]/div[2]/p[3]/cite[2]/text()').extract_first()recommend_week = str(num_list[2]) + response.xpath('/html/body/div/div[6]/div[1]/div[2]/p[3]/cite[3]/text()').extract_first()item = QidianxiaoshuoItem()item['book_name'] = book_nameitem['author'] = authoritem['intro'] = introitem['word_num'] = word_numitem['recommend_all'] = recommend_allitem['recommend_week'] = recommend_weekyield itemitems.py
import scrapyclass QidianxiaoshuoItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()book_name = scrapy.Field()author = scrapy.Field()intro = scrapy.Field()word_num = scrapy.Field()recommend_all = scrapy.Field()recommend_week = scrapy.Field()最終結(jié)果:
其中一個(gè)頁(yè)面:
CSV文件:
五、總結(jié)
遇見(jiàn)字體反爬的網(wǎng)頁(yè),需要更多的耐心,要在一張張網(wǎng)頁(yè)源碼中找到需要的信息(woff文件下載網(wǎng)址,亂碼部分)。
總結(jié)
以上是生活随笔為你收集整理的关于起点中文网字体反爬的解决方法——以阅读指数榜为例的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 批量修改文件名为上级目录名字
- 下一篇: java方法注释都英文_JDK源码中的英