當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

java实现微博热搜榜_微博热搜数据

發(fā)布時間：2023/12/10 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 java实现微博热搜榜_微博热搜数据小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

------主題式網(wǎng)絡主題式網(wǎng)絡爬蟲設(shè)計方案------

1.爬蟲名稱：爬取微博熱搜

2.爬蟲爬取的內(nèi)容：爬取微博熱搜數(shù)據(jù)。數(shù)據(jù)特征分析：各數(shù)據(jù)分布緊密聯(lián)系。

3.網(wǎng)絡爬蟲設(shè)計方案概述：

實現(xiàn)思路：通過訪問網(wǎng)頁源代碼使用xpath正則表達爬取數(shù)據(jù)，對數(shù)據(jù)進行保存數(shù)據(jù)，再對數(shù)據(jù)進行清洗和處理，數(shù)據(jù)分析與可視化處理。

技術(shù)難點：在編程的過程中，若中間部分出現(xiàn)錯誤，可能導致整個代碼需要重新修改。數(shù)據(jù)實時更新，會導致部分上傳的圖形不一致。

------主題頁面的結(jié)構(gòu)特征分析------

1.主題頁面的結(jié)構(gòu)和特征分析：爬取數(shù)據(jù)都分布在標簽'div.cc-cd-cb nano has-scrollbar'里面，標題標簽為'span.t',熱度標簽為'span.e'。

2.Htmls頁面解析：

3.節(jié)點(標簽)查找方法與遍歷方法：通過xpath遍歷標簽。利用xpath正則表達查找。

------網(wǎng)絡爬蟲程序設(shè)計------

importrequestsfrom lxml importetreeimportpandas as pdimportnumpy as npimportmatplotlib.pyplot as pltimportmatplotlibfrom scipy.optimize importleastsqimportscipy.stats as stsimportseaborn as sns

url= "https://tophub.today/"headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362'}

html= requests.get(url,headers = headers)

html= html.content.decode('utf-8')

html = etree.HTML(html)

div= html.xpath("//div[@id='node-1']/div")for a in div:

titles = a.xpath(".//span[@class='t']/text()")

numbers = a.xpath(".//span[@class='e']/text()")

b= []

for i in range(25):

b.append([i+1,titles[i],numbers[i][:-1]])

file = pd.DataFrame(b,columns = ['排名','今日熱搜','熱度(單位為萬)'])print(file)

file.to_csv('微博熱搜榜熱度數(shù)據(jù).csv')

2.對數(shù)據(jù)進行清洗和處理：

df = pd.DataFrame(pd.read_csv('微博熱搜榜熱度數(shù)據(jù).csv'))

df.head()

df.drop('今日熱搜', axis=1, inplace=True)

df.head()

df.isnull().sum()

df[df.isnull().values==True]

df.describe()

3.數(shù)據(jù)分析與可視化：

df.corr()

sns.lmplot(x='排名',y='熱度(單位為萬)',data=df)

defone():

file_path= "'微博熱搜榜熱度數(shù)據(jù).csv'"x= df['排名']

y= df['熱度(單位為萬)']

plt.xlabel('排名')

plt.ylabel('熱度(單位為萬)')

plt.bar(x,y)

plt.title("繪制排名與熱度條形圖")

plt.show()

one()

deftwo():

x= df['排名']

y= df['熱度(單位為萬)']

plt.xlabel('排名')

plt.ylabel('熱度(單位為萬)')

plt.plot(x,y)

plt.scatter(x,y)

plt.title("繪制排名與熱度折線圖")

plt.show()

two()

defthree():

x= df['排名']

y= df['熱度(單位為萬)']

plt.xlabel('排名')

plt.ylabel('熱度(單位為萬)')

plt.scatter(x,y,color="red",label=u"熱度分布數(shù)據(jù)",linewidth=2)

plt.title("繪制排名與熱度散點圖")

plt.legend()

plt.show()

three()

4.根據(jù)數(shù)據(jù)之間的關(guān)系，分析兩個變量之間的相關(guān)系數(shù)，畫出散點圖，并建立變量之間的回歸方程：

一元一次回歸方程：

defmain():

colnames= ["排名","今日熱搜","number"] #由于運行存在問題，用number表示'熱度(單位為萬)'

f = pd.read_csv('微博熱搜榜熱度數(shù)據(jù).csv',skiprows=1,names=colnames)

X=f.排名

Y=f.numberdeffunc(p,x):

k,b=preturn k*x+bdeferror_func(p,x,y):return func(p,x)-y

p0= [1,20]

Para= leastsq(error_func,p0,args =(X,Y))

k,b=Para[0]print("k=",k,"b=",b)

plt.figure(figsize=(8,6))

plt.scatter(X,Y,color="green",label=u"熱度分布",linewidth=2)

x=np.linspace(0,30,25)

y=k*x+b

plt.plot(x,y,color="red",label=u"回歸方程直線",linewidth=2)

plt.title("微博今日熱搜排名和熱度關(guān)系圖")

plt.xlabel('排名')

plt.ylabel('熱度(單位為萬)')

plt.legend()

plt.show()

main()

一元二次回歸方程：

deffour():

colnames= ["排名","今日熱搜","number"] #由于運行存在問題，用number表示'熱度(單位為萬)'

f = pd.read_csv('微博熱搜榜熱度數(shù)據(jù).csv',engine='python',skiprows=1,names=colnames)

X=f.排名

Y=f.numberdeffunc(p,x):

a,b,c=preturn a*x*x+b*x+cdeferror_func(p,x,y):return func(p,x)-y

p0=[0,0,0]

Para=leastsq(error_func,p0,args=(X,Y))

a,b,c=Para[0]

plt.figure(figsize=(10,6))

plt.scatter(X,Y,color="green",label=u"熱度分布",linewidth=2)

x=np.linspace(0,30,25)

y=a*x*x+b*x+c

plt.plot(x,y,color="red",label=u"回歸方程直線",linewidth=2)

plt.title("微博今日熱搜排名和熱度關(guān)系圖")

plt.xlabel('排名')

plt.ylabel('熱度(單位為萬)')

plt.legend()

plt.show()

four()

5.完整代碼:

1 importrequests2 from lxml importetree3 importpandas as pd4 importnumpy as np5 importmatplotlib.pyplot as plt6 importmatplotlib7 from scipy.optimize importleastsq8 importscipy.stats as sts9 importseaborn as sns10

12 url = "https://tophub.today/"

13 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362'}14 html = requests.get(url,headers = headers)#發(fā)送get請求

15 #print(html.text)#獲取源代碼

19 html = html.content.decode('utf-8')#配置編碼

20 html = etree.HTML(html)#構(gòu)建一個xpath解析對象

22 div = html.xpath("//div[@id='node-1']/div")23 for a in div:#遍歷標簽

24 titles = a.xpath(".//span[@class='t']/text()")#xpath正則表達

25 numbers = a.xpath(".//span[@class='e']/text()")#xpath正則表達

29 b = []#創(chuàng)建一個空列表

30 for i in range(25):31 b.append([i+1,titles[i],numbers[i][:-1]])#拷貝前25組數(shù)據(jù)

32 file = pd.DataFrame(b,columns = ['排名','今日熱搜','熱度(單位為萬)'])33 #print(file)

35 file.to_csv('微博熱搜榜熱度數(shù)據(jù).csv',encoding = 'gbk')#保存文件，數(shù)據(jù)持久化

38 #讀取csv文件

39 df = pd.DataFrame(pd.read_csv("微博熱搜榜熱度數(shù)據(jù).csv",engine='python'))40 df.head()41

43 #刪除無效列與行

44 df.drop('今日熱搜', axis=1, inplace=True)45 df.head()46

48 #空值處理

49 df.isnull().sum()#返回0，表示沒有空值

52 #缺失值處理

53 df[df.isnull().values==True]#返回無缺失值

57 #用describe()命令顯示描述性統(tǒng)計指標

58 df.describe()59

60 #用corr()顯示各數(shù)據(jù)間的相關(guān)系數(shù)

61 df.corr()62

64 plt.rcParams['font.sans-serif'] = ['SimHei']65

66 plt.rcParams['axes.unicode_minus'] =False67

69 #用seabron.lmplot()方法，建立排名和熱度(單位為萬)之間的線性關(guān)系

70 sns.lmplot(x='排名',y='熱度(單位為萬)',data=df)71

74 #繪制排名與熱度條形圖

75 defone():76 file_path = "'微博熱搜榜熱度數(shù)據(jù).csv'"

77 x = df['排名']78 y = df['熱度(單位為萬)']79 plt.xlabel('排名')80 plt.ylabel('熱度(單位為萬)')81 plt.bar(x,y)82 plt.title("繪制排名與熱度條形圖")83 plt.show()84 one()85

88 #繪制排名與熱度折線圖

89 deftwo():90 x = df['排名']91 y = df['熱度(單位為萬)']92 plt.xlabel('排名')93 plt.ylabel('熱度(單位為萬)')94 plt.plot(x,y)95 plt.scatter(x,y)96 plt.title("繪制排名與熱度折線圖")97 plt.show()98 two()99

100

101 #繪制排名與熱度散點圖

102 defthree():103 x = df['排名']104 y = df['熱度(單位為萬)']105 plt.xlabel('排名')106 plt.ylabel('熱度(單位為萬)')107 plt.scatter(x,y,color="red",label=u"熱度分布數(shù)據(jù)",linewidth=2)108 plt.title("繪制排名與熱度散點圖")109 plt.legend()110 plt.show()111 three()112

113

114 #繪制一元一次方程

115 defmain():116 colnames = ["排名","今日熱搜","number"] #由于運行存在問題，用number表示'熱度(單位為萬)'

117 f = pd.read_csv('微博熱搜榜熱度數(shù)據(jù).csv',engine='python',skiprows=1,names=colnames)118 X =f.排名119 Y =f.number120

121

122 deffunc(p,x):123 k,b =p124 return k*x+b125

126

127 deferror_func(p,x,y):128 return func(p,x)-y129 p0 = [1,20]130 Para = leastsq(error_func,p0,args =(X,Y))131 k,b =Para[0]132

133 print("k=",k,"b=",b)134

135 plt.figure(figsize=(8,6))136 plt.scatter(X,Y,color="green",label=u"熱度分布",linewidth=2)137 x=np.linspace(0,30,25)138 y=k*x+b139

140 plt.plot(x,y,color="red",label=u"回歸方程直線",linewidth=2)141 plt.title("微博今日熱搜排名和熱度關(guān)系圖")142 plt.xlabel('排名')143 plt.ylabel('熱度(單位為萬)')144 plt.legend()145 plt.show()146 main()147

148

149 #繪制一元二次方程

150 deffour():151

152 colnames = ["排名","今日熱搜","number"] #由于運行存在問題，用number表示'熱度(單位為萬)'

153 f = pd.read_csv('微博熱搜榜熱度數(shù)據(jù).csv',engine='python',skiprows=1,names=colnames)154 X =f.排名155 Y =f.number156

157 deffunc(p,x):158 a,b,c=p159 return a*x*x+b*x+c160

161 deferror_func(p,x,y):162 return func(p,x)-y163

164 p0=[0,0,0]165 Para=leastsq(error_func,p0,args=(X,Y))166 a,b,c=Para[0]167 plt.figure(figsize=(10,6))168 plt.scatter(X,Y,color="green",label=u"熱度分布",linewidth=2)169

170 x=np.linspace(0,30,25)171 y=a*x*x+b*x+c172

173 plt.plot(x,y,color="red",label=u"回歸方程直線",linewidth=2)174 plt.title("微博今日熱搜排名和熱度關(guān)系圖")175 plt.xlabel('排名')176 plt.ylabel('熱度(單位為萬)')177 plt.legend()178 plt.show()179

180 four()

總結(jié)：通過方程圖可以更直觀的看出熱搜與熱度之間的關(guān)系

總結(jié)

以上是生活随笔為你收集整理的java实现微博热搜榜_微博热搜数据的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： mysql的join算法,Mysql J
下一篇： SpringBoot集成Mybatis用

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

java实现微博热搜榜_微博热搜数据

總結(jié)