當前位置：首頁 > 编程语言 > python >内容正文

python

python 爬虫，抓取所有豆瓣好友读的书，列出读过最多的书。（模拟loging豆瓣）...

發布時間：2025/7/14 python 20 豆豆

生活随笔收集整理的這篇文章主要介紹了 python 爬虫，抓取所有豆瓣好友读的书，列出读过最多的书。（模拟loging豆瓣）... 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

主程序： from util import * import re,urllib,urllib2dou=douban() username='xxxxxxxx' password='xxxxxxx' domain='http://www.douban.com/' origURL='http://www.douban.com/login' dou.setinfo(username,password,domain,origURL) dou.signin() page=dou.opener.open('http://www.douban.com/contacts/list') directions=re.findall(r'\s*href\s*=\s*"http://www.douban.com/people/([^/]*)',page.read()) dir={} dir_book={} i=0 for direction in directions:dir[direction]=0 num=len(dir) i=0 for nam, nothing in dir.items():name="http://book.douban.com/list/"+nam+"/collect"print name#i=i+1print "books of",nam,":"page=dou.opener.open(name)while True:p=page.read()books=re.findall(r'href="http://book.douban.com/subject/.*/">\s*([^<]*)',p)for book in books:print bookif book in dir_book:dir_book[book]=dir_book[book]+1else:dir_book[book]=1ds=re.search(r'</a><span class="next"><a[\n\s]*href="(http://[^"]*)">',p) if ds == None:breakpage=dou.opener.open(ds.group(1)) for book_name,times in sortDic(dir_book):print book_name+": "+str(times) print len(dir_book) util.py ： import re,urllib,cookielib,urllib2 import socket socket.setdefaulttimeout(5) def sortDic(Dict):return sorted(Dict.items(),key=lambda e:e[1])class crawl:def __init__(self,ini):self.pages=[]self.pages.append(ini)self.add=Truedef PageParser(self,page):print pagedirections=re.findall(r'\s*href\s*=\s*"(https?://[^"^\s^\(^\)]*)',page) #only add new pages when less then 100 pages if len(self.pages)>10:self.add=Falseif self.add==True: for direccion in directions:if not direccion.endswith(".js") and not direccion.endswith(".css") and not direccion.endswith(".exe") \and not direccion.endswith(".pdf"):self.pages.append(direccion)print self.pagesdef PageLoad(self,dir):print dirpage=urllib.urlopen(dir)self.PageParser(page.read())def crawl_pages(self):while len(self.pages)!=0:self.PageLoad(self.pages.pop(0))class douban(object):def __init__(self): self.app = '' self.signed_in = False self.cj = cookielib.LWPCookieJar() try: self.cj.revert('douban.coockie') except: None self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cj)) urllib2.install_opener(self.opener) def setinfo(self,username,password,domain,origURL):self.name=usernameself.pwd=passwordself.domain=domainself.origURL=origURLdef signin(self): i=0params = {'form_email':self.name, 'form_password':self.pwd, 'remember':1} req = urllib2.Request( 'http://www.douban.com/login', urllib.urlencode(params) ) r = self.opener.open(req)if r.geturl() == 'http://www.douban.com/': print 'Logged on successfully!' self.cj.save('douban.coockie') self.signed_in = Truepage=urllib.urlopen("http://www.douban.com")print page.read()return 0return 1 先模擬loging豆瓣，保存cookie，然后用根據豆瓣網頁特性，讀取好友列表，從每個好友里的收藏里，讀取所讀書籍的名字，把書名儲存在字典中避免重復。輸出按value排序字典。

轉載于:https://www.cnblogs.com/rabby/archive/2010/07/18/1780260.html

總結

以上是生活随笔為你收集整理的python 爬虫，抓取所有豆瓣好友读的书，列出读过最多的书。（模拟loging豆瓣）...的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： javamail command no
下一篇：十条不错的编程观点(转载）

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

python 爬虫 ，抓取所有豆瓣好友读的书，列出读过最多的书。（模拟loging豆瓣）...

總結

python 爬虫，抓取所有豆瓣好友读的书，列出读过最多的书。（模拟loging豆瓣）...