行业大数据爬虫练习
1、使用正則表達式查找使用requests庫獲取的網頁https://www.cnki.net/ 內容中的title內容;
import requests import re r = requests.get('https://www.cnki.net/') r.encoding = 'utf-8' # print(r.text) titles = re.findall(r"<title.*?>(.+?)</title>", r.text) print(titles)2、使用Xpath和Beautiful Soup查找網站https://weixin.sogou.com/中的熱詞,并將其存儲為Jason格式文檔。
#(1)Xpath方法 import requests from lxml import etree from json import dumps import lxml url="https://weixin.sogou.com/" html=requests.get(url) html.encoding="utf-8" selecter=etree.HTML(html.text)dict = {1:'',2:'',3:'',4:'',5:'',6:'',7:'',8:'',9:'',10:'',} for i in range(1,11):ppath = """//*[@id="topwords"]/li["""+str(i)+"""]/a/text()"""s=selecter.xpath(ppath)dict[i] = s[0]print (s[0])print(dict)dictJson=dumps(dict,indent=4,ensure_ascii=False) print(dictJson) #寫出到json文件 with open("result.txt", "w") as fp:fp.write(dictJson)beautiful soup
import urllib from bs4 import BeautifulSoupwith urllib.request.urlopen("https://weixin.sogou.com/") as url:res = url.read() soup = BeautifulSoup(res,"html.parser") book_div = soup.find(attrs={"id":"topwords"}) i=0 for t in book_div.find_all('a'):i+=1;print(i, t.get('title')) #輸出json方法同上lxml庫的安裝:https://blog.csdn.net/weixin_45203459/article/details/102577999
總結
 
                            
                        - 上一篇: JSON字符串转对象集合
- 下一篇: Redis(四):Spring + Je
