當前位置：首頁 > 编程语言 > python >内容正文

python

python 某江文学城小说标题爬虫+简单数据分析+词云可视化

發(fā)布時間：2023/12/18 python 19 豆豆

生活随笔收集整理的這篇文章主要介紹了 python 某江文学城小说标题爬虫+简单数据分析+词云可视化小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1.簡介

目標站點：某江文學城書庫
爬蟲工具：BeautifulSoup，requests
數(shù)據(jù)分析：pandas，matplotlib
詞云：wordcloud，re

PS. 鑒于江湖上一直流傳著某江老板摳門得很，程序員只有3個服務器也只有幾臺辦公還在小區(qū)里面，建議爬的時候通過sleep放慢爬取速度，減少給服務器的壓力。

2.爬蟲

2.1 url解析

在文庫首頁上隨便勾選幾個選項試試，觀察url的變化（注意【】框出的部分）：

性向：言情，按發(fā)表時間排序，只顯示已完成，得到的url：
https://www.jjwxc.net/bookbase.php?fw0=0&fbsj0=0&ycx0=0&【xx1=1】&mainview0=0&sd0=0&lx0=0&fg0=0&bq=&removebq=&【sortType=3】&collectiontypes=ors&searchkeywords=&【page=0】&【isfinish=2】
性向：純愛，按作品收藏排序，只顯示無限制，跳轉(zhuǎn)到第4頁，得到的url：
https://www.jjwxc.net/bookbase.php?fw0=0&fbsj0=0&ycx0=0&【xx2=2】&mainview0=0&sd0=0&lx0=0&fg0=0&bq=&removebq=&【sortType=4】&【page=4】&【isfinish=0】&collectiontypes=ors&searchkeywords=

總結(jié)出來幾個參數(shù)：

頁數(shù)： page，從1開始（但0也是第一頁）。上限為1000。
性向：言情是xx1=1、純愛xx2=2、百合xx3=3，如果要同時選擇多個性向就一起寫上
排序方式：sortType，更新時間=1，作品收藏=4，發(fā)表時間=3，作品積分=2
是否完結(jié)：isfinish，無限制=0，連載中=1，已完結(jié)=2

2.2 頁面元素解析

想要爬取的數(shù)據(jù)如下：

按F12打開開發(fā)者工具，查看頁面元素，發(fā)現(xiàn)所有信息在一個table中（class=“cytable”），每行是一個tr，每個單元格是一個td。

2.3 登錄

嘗試跳轉(zhuǎn)到超過10的頁面時會出現(xiàn)要求登錄的界面：

登錄晉江賬號后，按F12打開開發(fā)者工具，打開network選項卡，刷新頁面，尋找到對應數(shù)據(jù)包，在headers中復制cookie，加到爬蟲的請求頭中。

2.4 完整代碼

import pandas as pd import requests import BeautifulSoupdef main(save_path, sexual_orientation):"""save_path: 文件保存路徑sexual_orientation: 1：言情，2：純愛，3：百合，4：女尊，5：無CP"""for page in range(1, 1001):url = get_url(page, sexual_orientation)headers = {'cookie': 你的cookie,'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'}html = requests.get(url, headers=headers)html.encoding = html.apparent_encodingtry:data = parse(html.content)except:print("爬取失敗：", page)continueif len(data) == 0:breakdf = pd.DataFrame(data)df.to_csv(save_path, mode='a', header=False, index=False)print(page)time.sleep(3)def get_url(page, sexual_orientation):url = f"https://www.jjwxc.net/bookbase.php?fw0=0&fbsj0=0&ycx1=1&xx{sexual_orientation}={sexual_orientation}&mainview0=0&sd0=0&lx0=0&fg0=0&bq=-1&" \f"sortType=3&isfinish=2&collectiontypes=ors&page={page}"return urldef parse(document):soup = BeautifulSoup(document, "html.parser")table = soup.find("table", attrs={'class': 'cytable'})rows = table.find_all("tr")data_all = []for row in rows[1:]:items = row.find_all("td")data = []for item in items:text = item.get_text(strip=True)data.append(text)data_all.append(data)return data_allif __name__ == "__main__":main("言情.txt", 1)

3. 數(shù)據(jù)分析與可視化

使用pandas讀取爬到的數(shù)據(jù)，如下：

進行簡單的預處理

# 去重 df = df.drop_duplicates(subset=["author", "name"]) print("文章數(shù)量：", df.shape[0])# 時間類型轉(zhuǎn)換 df["publish_time"] = pd.to_datetime(df["publish_time"])# 字數(shù)轉(zhuǎn)換為萬字 df["word"] /= 10000# 積分轉(zhuǎn)換為萬 df["points"] /= 10000

3.1 柱狀圖

查看字數(shù)的最小值和最大值：

df["word"].min(), df["word"].max()

結(jié)果：(0.0001, 616.9603)（單位：萬字），故分組時最小值設(shè)為0，最大值設(shè)為700。

import pandas as pd import matplotlib.pyplot as plt import numpy as np# 設(shè)置數(shù)據(jù)分組的位置 bins_words = [0, 0.5, 1, 10, 20, 40, 60, 80, 100, 700]# 2018年以前發(fā)表的小說字數(shù)分布 words_distribution1 = pd.value_counts(pd.cut(df.query("publish_time<'2018-01-01'")["word"], bins=bins_words), sort=False) words_distribution1 /= np.sum(words_distribution1) # 歸一化# 2018年以后發(fā)表的小說字數(shù)分布 words_distribution2 = pd.value_counts(pd.cut(df.query("publish_time>='2018-01-01'")["word"], bins=bins_words), sort=False) words_distribution2 /= np.sum(words_distribution2) # 歸一化# 畫圖 plt.figure(dpi=100) plt.title("小說字數(shù)分布圖", fontsize=15) loc = np.array([i for i in range(len(words_distribution1.index))]) plt.bar(loc-0.15, words_distribution1.values, width=0.3, label="2018年以前") plt.bar(loc+0.15, words_distribution2.values, width=0.3, label="2018年以后") plt.xticks(loc, words_distribution1.index, rotation=45) plt.xlabel("字數(shù)/萬字") plt.ylabel("比例") plt.legend()

3.2 餅圖

類型統(tǒng)計。
大部分類型的格式為“原創(chuàng)-純愛-架空歷史-仙俠”，可以以“-”為間隔進行分割，取第三個標簽。
然而還有少量作品的類型格式為“隨筆”/“評論”/“未知”，直接訪問下標為2的元素會報錯，因此加入一個if語句進行處理。# 類型統(tǒng)計 tags = df["type"].apply(lambda x: x.split("-")[2] if len(x.split("-"))==4 else x) tag_count = pd.value_counts(tags)# 合并數(shù)量過少的類別 tag_count["其他"] = tag_count["未知"] + tag_count["隨筆"] + tag_count["評論"] + tag_count["詩歌"] + tag_count[""] tag_count = tag_count.drop(["未知","隨筆","評論","詩歌", ""])
風格統(tǒng)計# 風格統(tǒng)計 manner_count = pd.value_counts(df["manner"])# 合并數(shù)量過少的類別 manner_count["其他"] = manner_count["暗黑"] + manner_count["爆笑"] + manner_count["未知"] manner_count = manner_count.drop(["暗黑", "爆笑", "未知"])
畫圖fig, axes = plt.subplots(1, 2, figsize=(10,5), dpi=100) fig.subplots_adjust(wspace=0.05) axes[0].pie(tag_count,labels=tag_count.index,autopct='%1.2f%%',pctdistance=0.7,colors=[plt.cm.Set3(i) for i in range(len(tag_count))],textprops={'fontsize':10},wedgeprops={'linewidth': 1, 'edgecolor': "black"}) axes[0].set_title("類型", fontsize=15)axes[1].pie(manner_count,labels=manner_count.index,autopct='%1.2f%%',pctdistance=0.7,colors=[plt.cm.Accent(i) for i in range(len(manner_count))],textprops={'fontsize':10},wedgeprops={'linewidth': 1, 'edgecolor': "black"}) axes[1].set_title("風格", fontsize=15)

4. 小說標題詞云

from wordcloud import WordCloud import jieba import pandas as pd import matplotlib.pyplot as plt import re

4.1 分詞

要點：
- 使用jieba分詞庫對每個標題進行分詞（使用DataFrame的矢量化操作，加快處理速度）
- 原始標題中含有較多符號（如（上）、[ABO]、被河蟹后變成的*號）和英文字符，可以使用正則表達式進行去除。
  去除前 vs 去除后：
代碼：# 增加自定義詞 jieba.add_word("快穿") # 對每個標題進行分詞，以英文逗號為間隔 words_arr = df["name"].apply(lambda x: ",".join(jieba.cut(x))).values # 將分詞結(jié)果連起來 text = ",".join(words_arr) # 用正則表達式過濾不是中文或者逗號的字符（例如括號、星號） reg = "[^\u4e00-\u9fa5,]" text = re.sub(reg, '', text)

4.2 詞頻統(tǒng)計

要點
- 使用DataFrame進行詞頻統(tǒng)計
- 發(fā)現(xiàn)高頻詞匯中有空字符串（正則去除符號時的產(chǎn)物）和大量單字，對于數(shù)據(jù)分析來說無意義，因此需要進行去除。
去除前去除后
代碼
# 詞頻統(tǒng)計 words_list = text.split(",") df_freq = pd.DataFrame(pd.value_counts(words_list), columns=["頻數(shù)"]) df_freq.index.name="詞匯" # 去除單字 stop_words = df_freq[df_freq.index.str.len()<=1].index.tolist() df_freq = df_freq[df_freq.index.str.len()>1] # 熱門詞匯可視化 plt.figure(figsize=(15, 5)) x = [i for i in range(20)] y = df_freq.iloc[:20].values.flatten() labels = df_freq.index[:20] plt.bar(x, y, color='steelblue', width=0.5) plt.xticks(ticks=x, labels=labels, fontsize=15, rotation=45) plt.title("晉江文學城小說標題Top20熱門詞匯", fontsize=15)

4.3 詞云可視化

要點
- 蒙版：這里使用一張8bit灰度模式的晉江文學城logo作為蒙版。本質(zhì)是一個二維矩陣，需要顯示的部分值為0，不需要顯示的部分值為1。使用其他圖片（如RGB格式）可根據(jù)此規(guī)則自行轉(zhuǎn)化。
  （這里本來有一張蒙版圖片，但是系統(tǒng)老是通知這篇文章“版權(quán)不明”，不知道是啥原因，先拿掉試試）
- 停用詞：將詞頻統(tǒng)計中去除的單字作為停用詞。
代碼# 生成蒙版 mask = plt.imread(r"D:\2021研一上\地理信息可視化\數(shù)據(jù)\logo.bmp") wordcloud = WordCloud(font_path=r"C:\Windows\Fonts\simhei.ttf",stopwords=stop_words, width=800, height=600,mask=mask,max_font_size=150,mode='RGBA', background_color=None).generate(text) fig = plt.figure(figsize=(10, 10)) ax = fig.add_subplot(111) ax.axis("off") ax.imshow(wordcloud, interpolation='bilinear') plt.tight_layout(pad=4.5) fig.savefig("wordcloud.png")

總結(jié)

以上是生活随笔為你收集整理的python 某江文学城小说标题爬虫+简单数据分析+词云可视化的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：获奖!CACTER邮件安全网关荣获电子邮
下一篇：经济学计算机是必修课吗,大学中经济学专业

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

生活随笔