python之scrapy:攻克技术点ASP.NET分页处理、request和response传参、pipeline区分传来Items
上面一篇介紹了scrapy抓取的大概架構(gòu),此篇文章針對(duì)一些技術(shù)問(wèn)題進(jìn)行展開(kāi)說(shuō)明。
一、如何處理ASP.NET分頁(yè)?
我們還是深圳房地產(chǎn)信息系統(tǒng)為例,?
?
因?yàn)橹耙恢笔菍慉SP.NET的,.NET很多控件都是通過(guò)拖拽實(shí)現(xiàn)。很多代碼可以省去編寫過(guò)程,都是自動(dòng)生成的。這里的下一頁(yè)操作就是通過(guò)自動(dòng)生成的js代碼,scrapy框架是不能執(zhí)行JS代碼。但我們清楚他執(zhí)行了_doPostBack函數(shù),我們?cè)倏確doPostBack是怎么定義的。?
function __doPostBack(eventTarget, eventArgument) {if (!theForm.onsubmit || (theForm.onsubmit() != false)) {theForm.__EVENTTARGET.value = eventTarget;theForm.__EVENTARGUMENT.value = eventArgument;theForm.submit();} }來(lái)看看form1是怎么定義的?
<form name="Form1" method="post" action="index.aspx?javascript%3a_doPostBack('AspNetPager1'%2c'2')" id="Form1"> <input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value=""> <input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value=""> <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="dghA4AVaavTQEp48h3UsU0pxBkQIXyBt5k9xMnYc7lAPiW4jhB6DeTz+bLHqNiicT340W3E8oWRZP/u/oso9HuTTmRjM7qcpd2VbxKKgRY9CT0ZM3xlJEZLNtaTdBldZpfmozLsjBqdp/jzVFyHqrtybakbNR4CK2KXFmDJkIkynac0a5GIVwD0w2VYDmh40llArcntW86hbqVAUcLnh7aybgU5zdn0uuBbpxHJ/e8POBnkadJvDPV/zThfxPqsZs2wP9+NJL//WDUQQ6/exRnA3mdSsrOfeT1O6Tpl7Z0blgxzhBpOthPwgPdYBx8nLqeEoObijxvjFqQm8F8YaaElSJYJbsa8VRhd2dreBqDXItev6MlKHFy/TYA8CizrBIozDGzvVvZsBLNVxLvi8kVoYH7FvWF5Bf8WJ40ADdXwR6DSdbFFi8EdaQe2AVkZKo5pmNFSmKAJQgE6CnN3itTkSld5SHI3CXHlYluq6FONEnBdRnUH73OjURAWdcrAmK9gGicsfygadkCOG4QS5WVZ+9GEj0uFpzbk0G3nqtali/aEuZWmMz6Kz6bgMhlKzGGNzZs3RPHk5CtoZXlpnUCCVkqvxsg70xawRoqf/J//ETdGSwEtzUeDu7MS9k9fGkKJkIpgBqEdektdEu4ZOH1XaYjNk4/wX6KLASaBpd45JVaA4nShBtvinzME2vBtp4dJNlx9e0evmmmGaZ6g6rr84rgASw/QR3BR5BS5ND0QZ0lhhWLP521XcfTbtg8eWDDb4bFrvqkI5dYsBhtg4QjkQ+3DAyzWqO1nmpalAuzkLemqkSzWmBVrbx1pbhkLF/bR2dBnGdNWJPNKiKId2F+y8+QorPEkK7dykHUVw1wos5CQXYRsbXHJ7WMiP6Gm5Zz+llsbrfKz4pVJxxLjpKQbvq+pNseUV1prTxOOxPDhfl52c7ZNJNzsvjTpKu5Z7WiLUEVRK/71AAawaIMfzyaKKs+qCcn0l0b8F9FcHtHWgJgpNtmgB7Sv86b+cfF412X3puqMzzGdFYaQwDdT3M9AEOn7rTLbwAjTPA4mSmtfT7qyVnve4/MEZ15xjnvgUx36F6UnwSa0WKdmpNbdpTxquJE2/y8jLQtbJcHv0P0eCMl6NAJmZWI0s6wa8iowWRC2nuEWnTlK/RYpwwCfaiyh0iHKEKnCmWaeLMLsuXPO6k9w/V6xvtL/ndjE4q9YF0RV3oB5+2Af+4GCOWdELqgjH+/sf3uQNDlCOUmNlFDMZ4/7WXKGGwpbDK/KlSHyyRhY6RUB/ZvBSQ/xcYwUB5o2i8WUhcU6ZHMS21Tz5qkeHkp1nO0TMmpcyd/ehfZF7Mv8zI2A0u3kNZut7DRaARHiXXeZg25TeZrQIGe8TudSAV3k+vdg/ISEa8imP3tVQMV1OqHOGshgzpvrxY5mnSFJOeHxIeBFxCDIyuLKVbNQ19VPZLGqxgJSNQHdD0o6pETQor/IJhwVMhCXcV5JYmfx9vE+3SZy26GFxRck47ExVAgzSq4zFqkRoD8Q1jW5roRlwrxg4g46Wr1AbuLeSWmXTq2addliIVbI3W7oyGqQy9ox5ranMcupogFCQVgl8VshD7l9AqW7F8FvOVRHM9I4haQBm3sQz0GciNzPeoPEavJsMLJdn4zHTaZGhaMWCj6ADIXrfdpWFQ9AWzmAJ/FQdcrXDh2wohvydjPBB+uhv2cUY++wx/yp1ylfa9DA/Dlrbi3We9wJ74BiBB2uNe0rUjwzGGZ418u5yHdCrLme57Af75aXiQkhLjeXgRcE1N+kQ5aJhqv6+TFsG0cW/Rwu3cUW9cdOUjyXDmPk//GCizaGAoHlu2AzSlLpKtWC+GdUe6WtEmpkPlxsC6kecYSbx0cZBVBhdNNcG16b6mpg+WnIJROMJAfv+VIyRmMochjwssPL+0uXh/UqQlasRv8u4POVJWls/OKASqxfUVae2k79/6SYFb82buISfV5AZlw8YnVDCNQ7bPubKUnDyMzcN9fxjZbnTFOdVaQBAzjd2DB/OTAR+/tIDvWEFK+YpNaiyFl+HrJFfWGQ+9+ryPGSwQJKa4uVsMrsF2/jQa8i2c0ce+S7ya9YX6P4PRhEkyswvLua51VkFJlnkLerbN/C7yLuvtAy5gXAhSUsbSuSYXDP+RJZIiiefWMKPQdAjIIXwt2YiHNgoMHy+2L2TROXkTR2pxx+m7s4n71NiMOU1rVsEh2vLq7JocQKNjWPj2ZEFZyOQ7FnfHl98LBKrCGT6Y2DFz9xAZ/VGzeJrVVE2vHu+qLmf9l3exE1sGFI9NlmNjIWGhYJMbQWbEBfqgvjXtPQ8W6RSrGm1sBuPc+vGOsDRwNXC4sPSVRvG5yXLEExMNdvdSCQaX5hUPXewDDXZi0COpaEofhP4esF4UQL4p+P+16QV9J37cI0rNgAHNmG6124VxnGFFO6vvZeAXW5Zo9ymyDCvM7fHoM3sLsAwTnvzgv2mGrtKGlAd8v+QGWkP1BkH6i9Lu3mZ3bi0aq1e1dts3ASWzwGJPP8eED1wlinCNAcZ+q9WiGujKADo4dhPbTBBEc4mn/yCdQlZ5/v2Dw27a6C8qnb3dSmFSKLrso0h2zPXgJSWTsz2ZH3MjTi+wFk+GOU7EdmJwxNYSqOFapOJfKZjd8QNEVmR27PEY1bigK0UgLWmBNdZ8xhiBwzQIP/WCvqVsS2fPQj3OvEHfB44c9hGLggKt3u5ybcGmcp2NCSDQQao3Xu2e4q2jKpRw2BlluCJoX7sIDZ9fpRZbQ9Mb5Ik0FSuxve2317J4R9nHQuvKXcacom8/vciaCoBlDBPeOmdNxa6JW7AyWjX4rBB6kDj+Jrj2+oo3BubmNtf4yl3HovYev+FZ6GhBOzze0EMJBIufW8AZ6yUHwOi4VeGg2qCXC9YtVjZ7PiZjwk+tv+t3BxWvJ6XvmyOfVt3FX+ufHueKTxY/HnCburCuU66A6I4rpZp1DOOuak5XY7pS9BLaHFu1KoZLuLUnynNP8pPK9dXtMMlbMzR2p+Xn53C8Jfjq+rMKq9zn/386cAEwVlhdQ1fFBkhJ5BK0whGaGAscoq7NCWkJn+SmyZTES8HgIea+QQnDCPIYt+ie/SWdZx+BrOqGnCMdhivhRgdi+3fdli2mBUfaioyCeU4YUwGX76Rfdp+OqNr8jqryrGJVl5z08YMqtNbKPHTPVxFAqhgda/c6iBOiXZhkS7TA6d17NfVQ1Oc8db1oEEaq5Kb2y7Y20/tAMYo5jnwgcO7ceSpkdbMQY1O6OL9tMb3OKl0T9J1F7FaB2iRQN7qZW9KMWbT4Dd2h2gIlQgpNxAplxJgmOxysLMwS4nzWHeSH6nG4Bt2Z4S5nOrF+pVCpLaVBpnezDha6l6Sw3AmncG+Wl6GFwe86EDRyYkxuvDX0rN+K6J5J+L81XHvusq+NhoN7duYdlIFuhMoIkHSmUKWJxwDBCd9Npf0XNpkC0Etbsj3GNvE3GrsQqtE5HJfwfp8Vc6ndGWjZ27/X19Xy6XC4augblHJZPf3A9jFL1pt+Uw0XnmBvf2L/Z0amKeoD/dCRLp7QMfY2W15U5b63FzVJ9VtVl5tZcfUn0duiH/Mu9QGFKfLELXtnZp72rfeau0rjd7mQVpxkOLfuKMeYIekZJwz24CaA+5ariA3RTxj/ei900JfT168vn53wr3Wl3U5K5PJeDmMOF2fbAq4pUTgcUqNK9zwVGmFxefkF+NjtP+s7vBqqePLY7Ak6gzUbfptrzR3VUmZBCp3yNQkD1znBRk/JtI7RPhgNUbBZHDNBWGjNF3h/UDa86wWuiChVk7A9lm7Wh0X1zTnOA+98j9NJtPzPqHNaP8SvnGCYFZp7av5ORCCa/gP1UwmhZ642pMwD37qOkeldvVJ2Q6FcUmPWNIBdk/n3XexbbVXv+6YxHizWw8Lczw5WGPCw1nxnR6qa8eNdXG9+1FQqFkj4vg4rdfPdFEJzELjZivAJrp2LMS9CMm4ENdsv5iH2wHZWkQA7qRgcoPEtOoRpbI4OJFXX7qN+jGMGljosu8Ouc5hRo7iwWcqvuUtQ7QJjB9K4AcmcLM3JZUuehHWtSU+VFuzWhiApt5CNab/+QoyRKitAX/2nbQ0CVdkBULeTUtdlInjLN83LZphMU5XIlCZkAl8MJkQrWUKn+wQECjbg/QynXKrhYqqQ1PBdSMKNUcGjdtwJyhuXJKV1uWjJqvhkuQPYNN9WYykR+9vtUu8EsFUzuhNIn0bKiLtsBpmsby22Z8NJAXzOSnm+ofCrUzXY8cvZquK40ZV9IDN/DA/0n7a6uhxwxIDAAM9kJiRlWZTHTVMUln/xqWl2ZSO2rqycIwX6mtO/g19KRSiagfpo2QspLvhstpxDbeYWme6+bzSGNgGNQe5nc1HqD0zAqEOgOJq7AHP7Aut4HPlf/1VyhPJkyVH1XMmNBK/ys6qwEyBG3a6eicXwuA+ulH7sWoLC2TQ2RwbWIvQoAbKMrrqVKucDJAAccMKZKFHwmF1zbHQl6YOu6Js0cujCMn2v4PuOcoP5HfZTwP1sik1TVQ8ffhS+hO9XLruD4shuiPhNyEEGc2AiCoa/bm6HCupa+XKcq+MUV6J69G0abtGRmhsCaV0cTPjcPop73VpinFtHwKsxqiTAnFU44K5MbNOueH7Fg5iS6SaY/qBbFS1T8A0eg0eiKGZO/C7whMyUJufNnxeGcC6558Kn78Fdrl0xQxGwMUSrlxPKtdaEoWnSbbGmGjVE3MFJZwMtZ5t9xWA6TxmYpeGxgK2Kz9FO2VUqcu1EzOPKkqyXMqXEM02U8jz/QNbaR2w7f6O64ivT5PilQOi+fgUJUvMt0TAva94W3BWZEb6BPQxDcX3OvOR1KkNsChxOfLFVxrUlltx5sSPNhP/8eTBkNp+UJ0QPEhsjKyktgguazWGEk9nc5n7K24cXOIF+CVRW8O5UWwuyMl7P0S8z8MvtnHLtk0jfGFYpIHaOTW8v/4G5ohPH8JYETBAXTaaq6zIYCriq3/KRuRFZ1sJ0MRX43ryax7H51St07rQeDerdoiSqEnHnMxqKxMow8Bo37Cis/rtnyPyqsdaCIxKMnrEi0o/ak//I1sVmNovicpeAQguv9rs8Jqq/OC3hNa0ePqV87TjoICDBezoqjd0qGOkAZ8tCb63pklF+QlyEol25zWoRSwmm7ngLIV/DeMp6aVPj+5NCQdz0uyL50eT2BnEf0fx4kKKvqjMI736r3g82ap+VlCRpzyGwE9xD12JdAKI45PpKvG+mbLSbQZYsskfhWT9C0WU5u4NVhC2ETPqdGdzZGhY41J49O4MSIrA2k8XaJ4nb4jaOoT6KD1HFKS2p0N/HNvNNw0N+taMFXgtT7Gjaz20Yim5A0Ltqypt2dvHm/NEXrf3Or5N0HEvUgvroOHwq432ZLdssutaw/883bmGwla1fKmzI+w04jVt+fnw2XfEYBrZNPE5ckYah4Eo+qUdji6/wNaR5vFQNrfPDOBoME7TZEEYUdBsm83KAIvdHtpHp+eMJ2+Xx+j/oqOIrCdrMliQL5xjZ4nymztxBL/g3WFlv4jXiugNa8rItrSWh6x3thsdNY7XE1a4UPxvUsuih7CKi/ahD+6jvzKn8Wvm7uw2QOyupcstQBwwg2WeM+heJFEgkmVCzeWnc608mGfUjKPISs+f5ck1uwjh4ciNhu7rjG7RKDURAMHFrV99893NEM+V5eOItuvjgrzEWQiESlJiPQ4MhsWZGIZD72UWvNzricvkAcrf3SYjvVd+AKBFpEHyCR4r+VvuP6OlRZqbTHHG45J+ZB92BA6eO2LgY41uVe2exi7gWvjP0FlXoFa6EDozxNZrj+jrxJ9K6w4r8VWZ/i3HnDHjNykDG/v9E2uR5ydnM0QSxuFPqDOrDZhFdmPgLjj1xY7o6sLUxXU9zdY1W4in+2Xa2pNrHgzsFdoKfM6p2TNtCp2HdLhijUGkm5NYjepBHpvyRGSVYNffwX8F/F3XcsXk5y4KOvaGeSi2Dy2AJUKtgbiy5tFE9inzZcOh+OBFlYFAwA5EEIQU6gTSYZ8zaZ0Ybw+7s9EUooqam9Dp4rv13AhAsA5WODEPHt25eXhrl7/8bGUggMojGU8sApfi6L/lfbsqZAh1hI6QHgLfwHLRXgnDxfylHgCs3FZXYWY58iOscfxt0gK1mU9EHySaQFNRzw8vUQ+jT/wfFIrEafPsMREPc1/eEHbKAdZVG92G/MDelrrj">里面有很多hidden的控件,里面的值很多而且都是自動(dòng)產(chǎn)生的,具體我也不太清楚什么意思。?
但是我們清楚,他的原理就是向本頁(yè)面再次發(fā)送一次POST請(qǐng)求,進(jìn)行分頁(yè)。所以從URL地址上看沒(méi)有變化。既然有POST操作,具體傳了些什么參數(shù)。靈光一現(xiàn),與其研究JS代碼怎么封裝,不如直接抓包看看傳遞什么參數(shù)。
?這里使用經(jīng)常用到的wireshark,簡(jiǎn)單好用,清晰明了。
當(dāng)我們點(diǎn)擊下一步以后,我們可以看見(jiàn)發(fā)送了一個(gè)HTTP請(qǐng)求,執(zhí)行了POST方法。
進(jìn)一步展開(kāi)來(lái)看。所有的參數(shù),一一羅列。傳遞參數(shù)也就是頁(yè)面上隱藏的。
?
回頭再來(lái)看我寫的分頁(yè)代碼,進(jìn)行遞歸查詢。
def parse(self, response):context = response.xpath('//tr[@bgcolor="#F5F9FC"]/td[3]')dbhelp=RishomePipeline()for item in context:title=item.xpath('a/text()').extract_first()idstr=item.xpath('a/@href').extract_first()idstr=idstr[idstr.find('=')+1:]if dbhelp.ispropertyexits(idstr):returnrequest=scrapy.Request(url='http://ris.szpl.gov.cn/bol/projectdetail.aspx?id='+idstr, method='GET',callback=self.showdetailpage)yield request'''以下是分頁(yè)代碼,組合post_data結(jié)構(gòu)體,POST請(qǐng)求要使用 yield scrapy.FormRequest(url=response.url,formdata =post_data,callback=self.parse,dont_filter=True)函數(shù)。'''next_page = response.xpath('//*[@id="AspNetPager1"]/div[2]/a[3]/@href')pnum=next_page.extract_first().split(',')[1].replace("'","").replace(")","")post_data = {"__EVENTTARGET" : "AspNetPager1","__EVENTARGUMENT" :pnum,"__VIEWSTATEENCRYPTED" : "","tep_name" : "","organ_name" : "","site_address" : "","AspNetPager1_input" : "1"}a = response.xpath('//*[@id="__VIEWSTATE"]/@value')post_data['__VIEWSTATE']=a.extract_first()b=response.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')post_data['__VIEWSTATEGENERATOR']=b.extract_first()c = response.xpath('//*[@id="__EVENTVALIDATION"]/@value')post_data['__EVENTVALIDATION'] = c.extract_first()'''分頁(yè)到最后一頁(yè),‘下一頁(yè)’的按鈕就不是鏈接了,頁(yè)面沒(méi)有href參數(shù)了,此時(shí)判斷分頁(yè)結(jié)束,即 遞歸結(jié)束'''if pnum is not None and pnum!="":yield scrapy.FormRequest(url=response.url,formdata =post_data,callback=self.parse,dont_filter=True)?二、request和response之間如何傳參
有些時(shí)候需要將兩個(gè)頁(yè)面的內(nèi)容合并到一個(gè)item里面,這時(shí)候就需要在yield scrapy.Request的同時(shí),傳遞一些參數(shù)到一下頁(yè)面中。這時(shí)候可以這樣操作。
request=scrapy.Request(houseurl,method='GET',callback=self.showhousedetail)request.meta['biid']=biidyield requestdef showhousedetail(self,response):house=HouseItem()house['bulidingid']=response.meta['biid']三、pipeline區(qū)分傳來(lái)Items
各個(gè)頁(yè)面都會(huì)封裝items并將item傳遞給pipelines來(lái)處理,而pipelines接收的入口只有一個(gè)就是
def process_item(self, item, spider)函數(shù)用來(lái)區(qū)分item的辦法。
def process_item(self, item, spider):if str(type(item))=="<class 'rishome.items.RishomeItem'>":self.saverishome(item)if str(type(item))=="<class 'rishome.items.BulidingItem'>":self.savebuliding(item)if str(type(item))=="<class 'rishome.items.HouseItem'>":self.savehouse(item)return item # 必須實(shí)現(xiàn)返回?
?
總結(jié)
以上是生活随笔為你收集整理的python之scrapy:攻克技术点ASP.NET分页处理、request和response传参、pipeline区分传来Items的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
 
                            
                        - 上一篇: 推荐几个网盘搜索工具,大大提高你的找资源
- 下一篇: 编程语言中,还有一种语言,那就是用中文来
