背景
CSDN里有不少優秀的技術大佬,創作了很多的技術文章,獲得了很多同行的瀏覽、關注和點贊,拋開個人技術方向,于我們而言,應該如何才能寫出熱款爆文呢,是寫艱深難懂的深度分析,還是寫小白能懂的入門文章,那些知識點和技術方向才是大家最關注的呢?
期望
通過分析CSDN熱榜標題,分析出如何起出亮眼標題及文章內容熱詞。
實現
技術
Scrapy 數據爬取
jieba 中文分詞
wordcloud 生成詞云圖
前期分析
通過瀏覽器“開發者工具”分析,以下為我們所需要的URL
全站綜合榜
https://blog.csdn.net/phoenix/web/blog/hotRank?page=0&pageSize=20
領域內容榜
https://blog.csdn.net/phoenix/web/blog/hotRank?page=0&pageSize=20&child_channel=大數據
作者周榜
https://blog.csdn.net/phoenix/web/blog/weeklyRank?page=0&pageSize=20
作者總榜
https://blog.csdn.net/phoenix/web/blog/allRank?page=0&pageSize=20
通過作者周榜/總榜,我們可以獲取到熱門作者的主頁,進入其主頁,按訪問量對其文章進行排序,可以獲取訪問量最大的一些文章
各熱門博主熱門文章
https://blog.csdn.net/qq_37751911/?orderby=ViewCount
內容爬取
創建Spider
新建6個Spider
SummaryArticleSpider 爬取全站綜合熱榜文章列表ContentArticleSpider 爬取領域內容熱榜文章列表WeekAuthorSpider 爬取周熱榜作者列表,含主頁鏈接WeekArticleSpider 進入周熱榜作者主頁,爬取熱門文章列表AllAuthorSpider 爬取綜合熱榜作者列表,含主頁鏈接AllArticleSpider 進入綜熱合榜作者主頁,爬取熱門文章列表
# ContentArticleSpider
class ContentArticleSpider(scrapy.Spider):name = 'content'allowed_domains = ['https://blog.csdn.net/']start_url = 'https://blog.csdn.net/phoenix/web/blog/hotRank?page={0}&pageSize={1}&child_channel={2}'content_types = ["c/c++", "java", "javascript", "php", "python", "人工智能", "區塊鏈", "大數據", "移動開發", "嵌入式", "開發工具", "數據結構與算法", "測試", "游戲", "網絡", "運維"]rank_type = RANK_TYPES[1]def __init__(self):parent_dir = "{0}/{1}".format(BASE_OUTPUT_DIR, CURRENT_DATE)if not os.path.isdir(parent_dir):os.mkdir(parent_dir)logging.info("目錄{0}已創建".format(parent_dir))article_file_name = "{0}/{1}.csv".format(parent_dir, self.rank_type)with open(article_file_name, mode="w") as f:f.write("知識領域,標題,評論數,收藏數,瀏覽數,文章鏈接\n")def start_requests(self):for content_type in self.content_types:for page_num in range(MAX_PAGE_NUM):yield scrapy.Request(self.start_url.format(page_num, PAGE_SIZE, content_type.lower()),callback=self.parse,dont_filter=True)def parse(self, response):response_str = (str(response.body, 'utf-8'))response_json = json.loads(response_str)content_type = urllib.parse.unquote(str(response.url).split("=")[-1])if response_json['code'] == 200:articles = response_json['data']for article in articles:article_title = article['articleTitle']article_url = article['articleDetailUrl']comment_count = article['commentCount']favor_count = article['favorCount']view_count = article['viewCount']yield {"rankType": self.rank_type,"contentType": content_type,"articleTitle": article_title,"articleUrl": article_url,"commentCount": comment_count,"favorCount": favor_count,"viewCount": view_count}
# WeekArticleSpider
class WeekArticleSpider(scrapy.Spider):name = 'week_article'allowed_domains = ['https://blog.csdn.net/']start_url = 'https://blog.csdn.net/{1}'urls = {}rank_type = RANK_TYPES[4]def __init__(self):parent_dir = "{0}/{1}".format(BASE_OUTPUT_DIR, CURRENT_DATE)author_file_name = "{0}/{1}.csv".format(parent_dir, RANK_TYPES[2])if not os.path.isdir(parent_dir):os.mkdir(parent_dir)logging.info("目錄{0}已創建".format(parent_dir))with open(author_file_name, mode='r') as lines:next(lines)for line in lines:self.urls[line.split(",")[0]] = "{0}?orderby=ViewCount".format(line.split(",")[1])article_file_name = "{0}/{1}.csv".format(parent_dir, self.rank_type)with open(article_file_name, mode="w") as f:f.write("博主,碼齡,標題,評論數,瀏覽數,文章鏈接\n")def start_requests(self):for user_name, url in self.urls.items():yield scrapy.Request(url,callback=self.parse,dont_filter=True)@retry(tries=3, delay=5000)def parse(self, response):response_html = Selector(response)try:age = response_html.xpath('//*[@id="asideProfile"]/div[1]/div[2]/div[2]/span[1]/text()').get().replace("碼齡", "")nick_name = response_html.xpath('//*[@id="uid"]/span/text()').get()card_coms = response_html.xpath('//*[@id="articleMeList-blog"]/div[2]/div[*]')for card_com in card_coms:title = "".join(card_com.xpath('h4/a/text()').getall()).strip()url = card_com.xpath('h4/a/@href').get()view_count = card_com.xpath('div/p/span[2]/text()').get()comment_count = card_com.xpath('div/p/span[3]/text()').get()yield {"rankType": self.rank_type,"nickName": nick_name,"codeAge": age,"articleTitle": title,"commentCount": comment_count,"viewCount": view_count,"articleUrl": url}except Exception as e:logging.warning("訪問出現問題:{0}".format(str(e)))
輸出爬取結果
輸出6個文件
SummaryRankArticle.csv
標題,評論數,收藏數,瀏覽數,文章鏈接
ContentRankArticle.csv
知識領域,標題,評論數,收藏數,瀏覽數,文章鏈接
WeekRankAuthor.csv
博主,主頁鏈接
WeekRankArticle.csv
博主,碼齡,標題,評論數,瀏覽數,文章鏈接
AllRankAuthor.csv
博主,主頁鏈接
AllRankArticle.csv
博主,碼齡,標題,評論數,瀏覽數,文章鏈接
分析標題熱詞
使用jieba對標題進行分詞,進行清洗后,使用wordcloud繪制詞云圖
#coding=utf-8
import matplotlib.pyplot as plt
import jieba
from wordcloud import wordclouddef analyze(file_path, result_file_name):# 無用詞匯exclude_words = ['(',')','(',')','的','了','與','和','中','你','我']# 全部標題titles = []with open(file_path,'r') as lines:next(lines)for line in lines:titles.append(line.split(",")[0].strip())# 標題分詞列表segments = list(jieba.cut("".join(titles), cut_all=True))# 去除無用詞匯result = " ".join(filter(lambda w: w not in exclude_words,segments))# 繪制詞云圖wc = wordcloud.WordCloud(font_path='/Library/Fonts/Arial Unicode.ttf',background_color='white',max_font_size=100,min_font_size=10,max_words=200)wc.generate(result)wc.to_file('{0}.png'.format(result_file_name))# 展示詞云圖plt.figure(result_file_name)plt.imshow(wc)plt.axis('off')plt.show()if __name__=='__main__':date = '20210719'file_dict = [{"rank_type":"AllRankArticle","result_name":"綜合熱門博主文章熱詞"},{"rank_type":"ContentRankArticle","result_name":"領域熱榜文章熱詞"},{"rank_type":"SummaryRankArticle","result_name":"綜合熱榜文章熱詞"},{"rank_type":"WeekRankArticle","result_name":"周熱榜博主文章熱詞"}]for file in file_dict:file_path = "{0}/{1}.csv".format(date, file["rank_type"])result_file_name = file["result_name"]analyze(file_path,result_file_name)
分析結果
領域熱榜文章熱詞
綜合熱榜文章熱詞
周熱榜博主文章熱詞
綜合熱門博主文章熱詞
分析結論
綜合分析如下,寫作方向可做參考
編程語言可側重于Python、Java曲高和寡,可多寫些初級安裝、使用教程或者拿來即用的功能實現面試題是經久不衰的熱門搜索關鍵詞,大家都是比較務實功利的
總結
以上是生活随笔為你收集整理的Scrapy爬取并分析CSDN热门文章标题的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。