node.js web框架_使用Node.js进行Web爬取的终极指南
node.js web框架
So what’s web scraping anyway? It involves automating away the laborious task of collecting information from websites.
那么,什么是網(wǎng)絡(luò)抓取? 它涉及自動(dòng)化從網(wǎng)站收集信息的艱巨任務(wù)。
There are a lot of use cases for web scraping: you might want to collect prices from various e-commerce sites for a price comparison site. Or perhaps you need flight times and hotel/AirBNB listings for a travel site. Maybe you want to collect emails from various directories for sales leads, or use data from the internet to train machine learning/AI models. Or you could even be wanting to build a search engine like Google!
Web抓取有很多用例:您可能希望從各種電子商務(wù)網(wǎng)站收集價(jià)格以進(jìn)行價(jià)格比較。 或者,您可能需要旅行時(shí)間的航班和酒店/ AirBNB列表。 也許您想從各個(gè)目錄收集電子郵件以獲取銷(xiāo)售線索,或者使用互聯(lián)網(wǎng)上的數(shù)據(jù)來(lái)訓(xùn)練機(jī)器學(xué)習(xí)/ AI模型。 或者,您甚至可能想要構(gòu)建像Google這樣的搜索引擎!
Getting started with web scraping is easy, and the process can be broken down into two main parts:
網(wǎng)頁(yè)抓取很容易上手,該過(guò)程可以分為兩個(gè)主要部分:
- acquiring the data using an HTML request library or a headless browser, 使用HTML請(qǐng)求庫(kù)或無(wú)頭瀏覽器獲取數(shù)據(jù),
- and parsing the data to get the exact information you want. 并解析數(shù)據(jù)以獲得所需的確切信息。
This guide will walk you through the process with the popular Node.js request-promise module, CheerioJS, and Puppeteer. Working through the examples in this guide, you will learn all the tips and tricks you need to become a pro at gathering any data you need with Node.js!
本指南將通過(guò)流行的Node.js 請(qǐng)求承諾模塊CheerioJS和Puppeteer引導(dǎo)您完成該過(guò)程。 通過(guò)閱讀本指南中的示例,您將學(xué)到成為專(zhuān)業(yè)人士使用Node.js收集所需數(shù)據(jù)所需的所有提示和技巧!
We will be gathering a list of all the names and birthdays of U.S. presidents from Wikipedia and the titles of all the posts on the front page of Reddit.
我們將在Wikipedia上收集美國(guó)總統(tǒng)的所有姓名和生日的列表,以及Reddit主頁(yè)上所有職位的標(biāo)題。
First things first: Let’s install the libraries we’ll be using in this guide (Puppeteer will take a while to install as it needs to download Chromium as well).
首先,首先要安裝我們將在本指南中使用的庫(kù)(Puppeteer需要花一些時(shí)間才能安裝,因?yàn)樗残枰螺dChromium)。
發(fā)出第一個(gè)請(qǐng)求 (Making your first request)
Next, let’s open a new text file (name the file potusScraper.js), and write a quick function to get the HTML of the Wikipedia “List of Presidents” page.
接下來(lái),讓我們打開(kāi)一個(gè)新的文本文件(將文件命名為potusScraper.js),并編寫(xiě)一個(gè)快速函數(shù)以獲取Wikipedia“總統(tǒng)名單”頁(yè)面HTML。
Output:
輸出:
使用Chrome DevTools (Using Chrome DevTools)
Cool, we got the raw HTML from the web page! But now we need to make sense of this giant blob of text. To do that, we’ll need to use Chrome DevTools to allow us to easily search through the HTML of a web page.
太酷了,我們從網(wǎng)頁(yè)上獲得了原始HTML! 但是現(xiàn)在,我們需要弄清這一巨大的文本斑點(diǎn)。 為此,我們需要使用Chrome DevTools來(lái)輕松搜索網(wǎng)頁(yè)HTML。
Using Chrome DevTools is easy: simply open Google Chrome, and right click on the element you would like to scrape (in this case I am right clicking on George Washington, because we want to get links to all of the individual presidents’ Wikipedia pages):
使用Chrome DevTools很容易:只需打開(kāi)Goog??le Chrome,然后右鍵單擊要剪貼的元素(在這種情況下,我右鍵單擊George Washington,因?yàn)槲覀兿M@得指向所有總統(tǒng)個(gè)人維基百科頁(yè)面的鏈接) :
Now, simply click inspect, and Chrome will bring up its DevTools pane, allowing you to easily inspect the page’s source HTML.
現(xiàn)在,只需單擊檢查,Chrome就會(huì)彈出其DevTools窗格,使您可以輕松檢查頁(yè)面的源HTML。
用Cheerio.js解析HTML (Parsing HTML with Cheerio.js)
Awesome, Chrome DevTools is now showing us the exact pattern we should be looking for in the code (a “big” tag with a hyperlink inside of it). Let’s use Cheerio.js to parse the HTML we received earlier to return a list of links to the individual Wikipedia pages of U.S. presidents.
太棒了,Chrome DevTools現(xiàn)在向我們展示了我們應(yīng)該在代碼中尋找的確切模式(“大”標(biāo)簽中帶有超鏈接)。 讓我們使用Cheerio.js解析我們之前收到HTML,以返回指向美國(guó)總統(tǒng)個(gè)人Wikipedia頁(yè)面的鏈接列表。
Output:
輸出:
We check to make sure there are exactly 45 elements returned (the number of U.S. presidents), meaning there aren’t any extra hidden “big” tags elsewhere on the page. Now, we can go through and grab a list of links to all 45 presidential Wikipedia pages by getting them from the “attribs” section of each element.
我們檢查以確保返回的確有45個(gè)元素(美國(guó)總統(tǒng)的數(shù)量),這意味著頁(yè)面上其他任何地方都沒(méi)有多余的隱藏“大”標(biāo)簽。 現(xiàn)在,我們可以通過(guò)從每個(gè)元素的“攻擊者”部分獲取所有45個(gè)總統(tǒng)維基百科頁(yè)面的鏈接列表。
Output:
輸出:
Now we have a list of all 45 presidential Wikipedia pages. Let’s create a new file (named potusParse.js), which will contain a function to take a presidential Wikipedia page and return the president’s name and birthday. First things first, let’s get the raw HTML from George Washington’s Wikipedia page.
現(xiàn)在,我們列出了所有45個(gè)總統(tǒng)維基百科頁(yè)面。 讓我們創(chuàng)建一個(gè)新文件(名為potusParse.js),該文件將包含一個(gè)獲取總統(tǒng)Wikipedia頁(yè)面并返回總統(tǒng)的姓名和生日的函數(shù)。 首先,讓我們從George Washington的Wikipedia頁(yè)面獲取原始HTML。
Output:
輸出:
Let’s once again use Chrome DevTools to find the syntax of the code we want to parse, so that we can extract the name and birthday with Cheerio.js.
讓我們?cè)俅问褂肅hrome DevTools查找我們要解析的代碼的語(yǔ)法,以便我們可以使用Cheerio.js提取名稱(chēng)和生日。
So we see that the name is in a class called “firstHeading” and the birthday is in a class called “bday”. Let’s modify our code to use Cheerio.js to extract these two classes.
因此,我們看到該名稱(chēng)在一個(gè)名為“ firstHeading”的類(lèi)中,而生日在一個(gè)名為“ bday”的類(lèi)中。 讓我們修改代碼以使用Cheerio.js提取這兩個(gè)類(lèi)。
Output:
輸出:
全部放在一起 (Putting it all together)
Perfect! Now let’s wrap this up into a function and export it from this module.
完善! 現(xiàn)在,讓我們將其包裝為一個(gè)函數(shù),然后從該模塊中將其導(dǎo)出。
Now let’s return to our original file potusScraper.js and require the potusParse.js module. We’ll then apply it to the list of wikiUrls we gathered earlier.
現(xiàn)在,讓我們回到原始文件potusScraper.js,并需要potusParse.js模塊。 然后,將其應(yīng)用于我們之前收集的WikiUrl列表。
Output:
輸出:
渲染JavaScript頁(yè)面 (Rendering JavaScript Pages)
Voilà! A list of the names and birthdays of all 45 U.S. presidents. Using just the request-promise module and Cheerio.js should allow you to scrape the vast majority of sites on the internet.
瞧! 所有45位美國(guó)總統(tǒng)的姓名和生日的列表。 僅使用request-promise模塊和Cheerio.js應(yīng)該可以讓您抓取Internet上的絕大多數(shù)站點(diǎn)。
Recently, however, many sites have begun using JavaScript to generate dynamic content on their websites. This causes a problem for request-promise and other similar HTTP request libraries (such as axios and fetch), because they only get the response from the initial request, but they cannot execute the JavaScript the way a web browser can.
但是,最近,許多站點(diǎn)已開(kāi)始使用JavaScript在其網(wǎng)站上生成動(dòng)態(tài)內(nèi)容。 這對(duì)請(qǐng)求承諾和其他類(lèi)似的HTTP請(qǐng)求庫(kù)(例如axios和fetch)造成了問(wèn)題,因?yàn)樗鼈儍H從初始請(qǐng)求中獲取響應(yīng),但是無(wú)法像Web瀏覽器那樣執(zhí)行JavaScript。
Thus, to scrape sites that require JavaScript execution, we need another solution. In our next example, we will get the titles for all of the posts on the front page of Reddit. Let’s see what happens when we try to use request-promise as we did in the previous example.
因此,要抓取需要執(zhí)行JavaScript的網(wǎng)站,我們需要另一個(gè)解決方案。 在下一個(gè)示例中,我們將在Reddit的首頁(yè)上獲得所有帖子的標(biāo)題。 讓我們看看當(dāng)我們嘗試使用上一個(gè)示例中的請(qǐng)求承諾時(shí)會(huì)發(fā)生什么。
Output:
輸出:
Here’s what the output looks like:
輸出如下所示:
Hmmm…not quite what we want. That’s because getting the actual content requires you to run the JavaScript on the page! With Puppeteer, that’s no problem.
嗯...不是我們想要的。 那是因?yàn)楂@取實(shí)際內(nèi)容需要您在頁(yè)面上運(yùn)行JavaScript! 使用Puppeteer,這沒(méi)問(wèn)題。
Puppeteer is an extremely popular new module brought to you by the Google Chrome team that allows you to control a headless browser. This is perfect for programmatically scraping pages that require JavaScript execution. Let’s get the HTML from the front page of Reddit using Puppeteer instead of request-promise.
Puppeteer是Google Chrome小組為您帶來(lái)的一種非常受歡迎的新模塊,可讓您控制無(wú)頭瀏覽器。 對(duì)于以編程方式抓取需要執(zhí)行JavaScript的頁(yè)面而言,這是完美的選擇。 讓我們使用Puppeteer而不是request-promise從Reddit的首頁(yè)獲取HTML。
Output:
輸出:
Nice! The page is filled with the correct content!
真好! 該頁(yè)面填充了正確的內(nèi)容!
Now we can use Chrome DevTools like we did in the previous example.
現(xiàn)在,我們可以像上一個(gè)示例一樣使用Chrome DevTools。
It looks like Reddit is putting the titles inside “h2” tags. Let’s use Cheerio.js to extract the h2 tags from the page.
看來(lái)Reddit會(huì)將標(biāo)題放在“ h2”標(biāo)簽中。 讓我們使用Cheerio.js從頁(yè)面中提取h2標(biāo)簽。
Output:
輸出:
其他資源 (Additional Resources)
And there’s the list! At this point you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey:
有清單! 在這一點(diǎn)上,您應(yīng)該編寫(xiě)第一個(gè)Web抓取工具以從任何網(wǎng)站收集數(shù)據(jù)都感到很舒服。 以下是一些其他資源,在您的網(wǎng)絡(luò)抓取過(guò)程中可能會(huì)有所幫助:
List of web scraping proxy services
網(wǎng)絡(luò)抓取代理服務(wù)列表
List of handy web scraping tools
方便的網(wǎng)頁(yè)抓取工具列表
List of web scraping tips
網(wǎng)絡(luò)抓取技巧列表
Comparison of web scraping proxies
網(wǎng)頁(yè)抓取代理的比較
Cheerio Documentation
Cheerio文檔
Puppeteer Documentation
木偶文件
翻譯自: https://www.freecodecamp.org/news/the-ultimate-guide-to-web-scraping-with-node-js-daa2027dcd3/
node.js web框架
總結(jié)
以上是生活随笔為你收集整理的node.js web框架_使用Node.js进行Web爬取的终极指南的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到孕妇来月经是怎么回事
- 下一篇: 梦到银行卡丢了什么预兆