Java:爬取代理ip,并使用代理IP刷uv
前言
很多網(wǎng)站對訪問量的判斷并不嚴格,只要頁面被點擊即視為有效訪問,但是應該現(xiàn)在反爬蟲越來越嚴格,為防止部分網(wǎng)站會對IP進行拉黑,所以這里寫一個小程序爬取代理IP,并使用代理IP刷訪問量。原本想把代理IP包裝成一個bean類,但是發(fā)現(xiàn)爬下來的代理IP都不需要用戶名和密碼,那就只要ip和端口就行了,索性實現(xiàn)得簡單點,只關心ip和端口好了。
模塊組織
FileUtil:用于提供爬取到的IP和url的寫入文件、讀取文件的接口。
CheckUtil:用于校驗爬取到的IP是否可用。
SpiderUtil:爬蟲動作的主要模塊,用于爬取IP和需要刷訪問量的url。
ClickUtil:使用代理IP訪問指定url。
FileUtil
包含write、read兩個方法,傳入文件名,向該文件中寫入(讀取)數(shù)據(jù)。
package com.zixuan.add_uv.utils;import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.util.ArrayList; import java.util.List; import java.util.Scanner;public class FileUtil {//追加寫入數(shù)據(jù)public void write(String selection,String data,boolean isAppend){//輸出流File file = new File(System.getProperty("user.dir")+"/conf");if(!file.exists()){file.mkdir();}try{if (selection.toLowerCase().equals("ip")){file = new File(System.getProperty("user.dir")+"/conf/ip.txt");}if(selection.toLowerCase().equals("url")){file = new File(System.getProperty("user.dir")+"/conf/url.txt");}FileOutputStream fos = new FileOutputStream(file,isAppend);fos.write(data.getBytes());fos.write("\r\n".getBytes());fos.close();}catch (Exception e){System.out.println("寫入文件失敗。");}}//讀取文件,并將文件內容寫入一個list里,返回該listpublic List<String> readFile(String fileName){List<String> listStr = new ArrayList<>();//輸入流String path = FileUtil.class.getResource("").getPath();File file = new File(System.getProperty("user.dir")+"/conf/"+fileName);try{FileInputStream is = new FileInputStream(file);Scanner scanner = new Scanner(is);while (scanner.hasNextLine()) {listStr.add(scanner.nextLine());}scanner.close();is.close();} catch (Exception e) {System.out.println("讀取文件失敗");}return listStr;} }CheckUtil
使用代理IP訪問百度,如果能夠訪問,則返回true,不能訪問則返回false。
package com.zixuan.add_uv.utils;import org.jsoup.Jsoup;public class CheckIPUtil {//測試代理IP是否可用public static boolean checkProxy(String ip, Integer port) {System.out.println("檢查中:"+ip);try {Jsoup.connect("http://www.baidu.com").timeout(1 * 1000).proxy(ip, port).get();System.out.println(ip+"可用");return true;} catch (Exception e) {System.out.println("失敗,"+ip+"不可用");return false;}}public static boolean checkProxy(String s){String[] strings = s.split(" ");return checkProxy(strings[0],Integer.parseInt(strings[1]));} }SpiderUtil
主要實現(xiàn)爬取代理IP和url兩個功能。
爬取代理IP:指定代理IP網(wǎng)站,以及爬取的頁數(shù)。從頁面中獲取代理IP和端口后,使用CheckUtil中的校驗方法檢驗,如果可用,則追加寫入ip.txt文件中。并實現(xiàn)了runable接口,可以使用多線程進行爬取,提高效率。
爬取url:修改代碼中中文寫的兩處地方。分別是正則表達式和爬取的頁面。舉個例子,如果要刷qq空間訪問量,你需要傳入一個用戶的主頁,然后用正則匹配出用戶主頁的說說的url,程序會自動爬取說說的url并寫入到url.txt文件。
package com.zixuan.add_uv.utils;import com.alibaba.fastjson.JSONObject; import com.zixuan.add_uv.bean.ProxyInfo; import org.jsoup.Jsoup; import org.jsoup.nodes.Document;import java.io.IOException; import java.util.*; import java.util.regex.Matcher; import java.util.regex.Pattern;public class SpiderUtil {static FileUtil fileUtil = new FileUtil();//爬取代理IPpublic static void spiderIP(String url, int totalPage) {String ipReg = "\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3} \\d{1,6}";Pattern ipPtn = Pattern.compile(ipReg);for (int i = 1; i <= totalPage; i++) {System.out.println("開始爬取第"+i+"/"+totalPage+"頁...");//發(fā)送請求,獲取文檔Document doc = null;try {doc = getDocument(url + i , "www.kuaidaili.com");} catch (IOException e) {System.out.println("鏈接不可用,爬取失敗:"+url+i);return;}Matcher m = ipPtn.matcher(doc.text());while (m.find()) {String s = m.group();if (CheckIPUtil.checkProxy(s)) {fileUtil.write("IP", s, true);}}}}//爬取博客urlpublic static void spiderUrl(String username) {HashSet<String> urlSet = new HashSet<String>();String urlReg = "這里寫匹配頁面中爬取的正則";Pattern urlPtn = Pattern.compile(urlReg);Document doc = null;try {doc = getDocument("這里寫要爬取的頁面", "爬取網(wǎng)站的host");} catch (IOException e) {e.printStackTrace();return;}Matcher m = urlPtn.matcher(doc.body().html());while (m.find()) {String s = m.group();urlSet.add(s);}Iterator<String> iterator = urlSet.iterator();while (iterator.hasNext()) {String s = iterator.next();System.out.println(s);fileUtil.write("URL", s, true);}}//獲取頁面public static Document getDocument(String url, String host) throws IOException {Document doc = null;doc = Jsoup.connect(url).header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8").header("Accept-Encoding", "gzip, deflate, sdch").header("Accept-Language", "zh-CN,zh;q=0.8,en;q=0.6").header("Cache-Control", "max-age=0").header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36").header("Cookie", "Hm_lvt_7ed65b1cc4b810e9fd37959c9bb51b31=1462812244; _gat=1; _ga=GA1.2.1061361785.1462812244").header("Host", host).header("Referer", "https://" + host + "/").timeout(30 * 1000).get();return doc;}//創(chuàng)建爬ip的runnable對象public static spiderIpExcutor excutorBulid(String url,int totalPage){return new spiderIpExcutor(url, totalPage);}//執(zhí)行爬蟲的runnable類static class spiderIpExcutor implements Runnable{String url = null;int totalPage = 0;public spiderIpExcutor(String url,int totalPage){this.url=url;this.totalPage=totalPage;}@Overridepublic void run() {if (url.equals("")||url==null||totalPage<=0){System.out.println("參數(shù)錯誤");}else {spiderIP(url,totalPage);}}}}ClickUtil
click:點擊行為的主要實現(xiàn)。
clickAll:傳入一個IP,使用這個ip訪問所有的url。在此方法中調用click方法。
ClickExcutor:實現(xiàn)runable接口,使用多線程進行訪問,提高刷uv的效率。
package com.zixuan.add_uv.utils;import org.jsoup.Connection; import org.jsoup.Jsoup; import org.jsoup.nodes.Document;import java.io.IOException; import java.util.Iterator; import java.util.List;public class ClickUtil {//訪問urlpublic static void click(String url, String proxy) throws IOException {String proxyIP = proxy.split(" ")[0];int proxyPort = Integer.parseInt(proxy.split(" ")[1]);Document doc = Jsoup.connect(url).header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8").header("Accept-Encoding", "gzip, deflate, sdch").header("Accept-Language", "zh-CN,zh;q=0.8,en;q=0.6").header("Cache-Control", "max-age=0").header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36").header("Cookie", "Hm_lvt_7ed65b1cc4b810e9fd37959c9bb51b31=1462812244; _gat=1; _ga=GA1.2.1061361785.1462812244").header("Host", "寫目標網(wǎng)站的host").header("Referer", "寫上一個頁面的地址,指明是從哪個頁面跳轉到這里的").timeout(10 * 1000).proxy(proxyIP, proxyPort).ignoreContentType(true).get();try {Thread.sleep(5*1000);} catch (InterruptedException e) {e.printStackTrace();}}//使用一個IP訪問所有url//如果失敗三次,則停止,下一個IPpublic static void clickAll() {FileUtil fileUtil = new FileUtil();Iterator<String> ips = fileUtil.readFile("ip.txt").iterator();while (ips.hasNext()) {String ip = ips.next();int exceptionFlag = 0;Iterator<String> urls = fileUtil.readFile("url.txt").iterator();while (urls.hasNext()) {String url = urls.next();System.out.println("嘗試訪問:"+url+"\n 使用代理:"+ip);try {click(url, ip);} catch (IOException e) {exceptionFlag++;}if(exceptionFlag>=3){break;}}}}//獲取excutor的build方法public static ClickExcutor excutorBuild(int time){return new ClickExcutor(time);}//點擊行為的runable類static class ClickExcutor implements Runnable{int time = 1;public ClickExcutor(int time){if(time>1) {this.time = time;}else {System.out.println("輸入次數(shù)不正確,默認執(zhí)行一次");}}@Overridepublic void run() {for (int i = 0; i < time; i++) {clickAll();}}} }controler
程序的入口,使用線程池同時爬取多個網(wǎng)站的代理IP。并在30s延遲后,開始刷uv。
package com.zixuan.add_uv.controler;import com.zixuan.add_uv.utils.ClickUtil; import com.zixuan.add_uv.utils.FileUtil; import com.zixuan.add_uv.utils.SpiderUtil;import java.util.List; import java.util.concurrent.Executors; import java.util.concurrent.ScheduledExecutorService; import java.util.concurrent.TimeUnit;public class Controler {public static void main(String[] args) {ScheduledExecutorService scheduledThreadPool = Executors.newScheduledThreadPool(8);scheduledThreadPool.schedule(SpiderUtil.excutorBulid("https://www.xicidaili.com/nn/",150),1, TimeUnit.SECONDS);scheduledThreadPool.schedule(SpiderUtil.excutorBulid("https://www.xicidaili.com/nt/",150),1,TimeUnit.SECONDS);scheduledThreadPool.schedule(SpiderUtil.excutorBulid("https://www.xicidaili.com/wt/",150),1,TimeUnit.SECONDS);scheduledThreadPool.schedule(SpiderUtil.excutorBulid("https://www.xicidaili.com/wn/",150),1,TimeUnit.SECONDS);scheduledThreadPool.schedule(SpiderUtil.excutorBulid("https://ip.jiangxianli.com/?page=",150),1,TimeUnit.SECONDS);SpiderUtil.spiderUrl("xxxxx");scheduledThreadPool.schedule(ClickUtil.excutorBuild(5000),30,TimeUnit.SECONDS);scheduledThreadPool.schedule(ClickUtil.excutorBuild(5000),60,TimeUnit.SECONDS);scheduledThreadPool.schedule(ClickUtil.excutorBuild(5000),90,TimeUnit.SECONDS);} }?
總結
以上是生活随笔為你收集整理的Java:爬取代理ip,并使用代理IP刷uv的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Android:LiveData pos
- 下一篇: CWnd与CDialog-DoModal