对内涵段子正则的提取
生活随笔
收集整理的這篇文章主要介紹了
对内涵段子正则的提取
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
內涵段子正則爬取:
""" 內涵段子爬蟲 https://www.neihan8.com/article/index.html""" from urllib import request,parse from urllib import error import chardet from lxml import etree import csv,string,re import csv def neihanba(url,beginPage, endPage):for page in range(beginPage, endPage):pn = pageif pn <= 1:fullurl = url + "index.html"else:fullurl = url + "index_%s"%pn + ".html"headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}req = request.Request(fullurl, headers=headers)try:response = request.urlopen(req)resHtml = response.read()resHtml = resHtml.decode("utf-8", 'ignore')# 笑話標題title = r'<h3><a .*?>(.*?)</a></h3>'title_pattern = re.compile(title,re.I | re.S | re.M)joketitle = title_pattern.findall(resHtml)# 笑話內容content = r'<div class="desc">.*?(.*?)</div>'content_pattern = re.compile(content, re.I | re.S | re.M)jokecontent = content_pattern.findall(resHtml)for m in range(1,len(jokecontent)):k = jokecontent[m]filename = './data1/neihanba' + '.csv'with open(filename, 'a', encoding='utf-8') as file:wr = csv.writer(file)wr.writerow([joketitle,jokecontent])# 笑話urljokeurl = r'<h3><a href="(.*?)" .*?>.*?</a></h3>'url_patter = re.compile(jokeurl, re.I | re.S | re.M)jurl = url_patter.findall(resHtml)for i in jurl:jokefullurl = "https://www.neihan8.com" + iresponse = request.urlopen(jokefullurl)resHtml = response.read()resHtml = resHtml.decode("utf-8", 'ignore')# 笑話標題jokecontitle = r'<h1 class="title">(.*?)</h1>'jokecontitle_pattern = re.compile(jokecontitle, re.I | re.S | re.M)jokecontitle_content = jokecontitle_pattern.findall(resHtml)for a in jokecontitle_content:joke_content_title = a# 笑話內容jokecontent1 = r'<p>(.*?)</p>'joke_pattern = re.compile(jokecontent1, re.I | re.S | re.M)joke_content = joke_pattern.findall(resHtml)for s in range(len(joke_content)-2):openjoke_content = joke_content[s]filename = './data1/neihanba1' + '.csv'with open(filename, 'a', encoding='utf-8') as file:wr = csv.writer(file)wr.writerow([openjoke_content])except error.URLError as e:print(e)if __name__ == "__main__":proxy = {"http": "118.31.220.3:8080"}proxy_support = request.ProxyHandler(proxy)opener = request.build_opener(proxy_support)request.install_opener(opener)beginPage = int(input("請輸入起始頁:"))endPage = int(input("請輸入終止頁:"))url = "https://www.neihan8.com/article/"neihanba(url, beginPage, endPage)?
總結
以上是生活随笔為你收集整理的对内涵段子正则的提取的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 全新BMS开发板 /凌力尔特LTC680
- 下一篇: 【Linux系统中的】磁盘管理