htmlcleaner 下載地址:htmlcleaner2_1.jar?源碼下載:htmlcleaner2_1-all.zip
寫一個測試用的html文件:html-clean-demo.html
<!DOCTYPE?html?PUBLIC?"-//W3C//DTD?XHTML?1.0?Transitional"?"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd?"?>???<?html???xmlns?=?"http://www.w3.org/1999/xhtml?"???xml:lang?=?"zh-CN"???dir?=?"ltr"?>??<?head?>???????<?meta???http-equiv?=?"Content-Type"???content?=?"text/html;?charset=GBK"?/>???????<?meta???http-equiv?=?"Content-Language"???content?=?"zh-CN"?/>???????<?title?>?html?clean?demo?</?title?>???</?head?>???<?body?>???<?div???class?=?"d_1"?>???????<?ul?>???????????<?li?>?bar?</?li?>???????????<?li?>?foo?</?li?>???????????<?li?>?gzz?</?li?>???????</?ul?>???</?div?>???<?div?>???????<?ul?>???????????<?li?>?<?a???name?=?"my_href"???href?=?"1.html"?>?text-1?</?a?>?</?li?>???????????<?li?>?<?a???name?=?"my_href"???href?=?"2.html"?>?text-2?</?a?>?</?li?>???????????<?li?>?<?a???name?=?"my_href"???href?=?"3.html"?>?text-3?</?a?>?</?li?>???????????<?li?>?<?a???name?=?"my_href"???href?=?"4.html"?>?text-4?</?a?>?</?li?>???????</?ul?>???</?div?>???</?body?>???</?html?>??? 模擬需求:取出title,name="my_href" 的鏈接,div的class="d_1"下的所有li內容。下面用htmlcleaner寫代碼,HtmlCleanerDemo.java
package??com.chenlb;????import??java.io.File;????import??org.htmlcleaner.HtmlCleaner;??import??org.htmlcleaner.TagNode;????public???class??HtmlCleanerDemo?{????????public???static???void??main(String[]?args)??throws??Exception?{??????????HtmlCleaner?cleaner?=?new??HtmlCleaner();????????????TagNode?node?=?cleaner.clean(new??File(?"html/html-clean-demo.html"?),??"GBK");??????????????????Object[]?ns?=?node.getElementsByName("title"?,??true?);???????????????if?(ns.length?>??0?)?{??????????????System.out.println("title="?+((TagNode)ns[?0?]).getText());??????????}??????????System.out.println("ul/li:"?);??????????????????ns?=?node.evaluateXPath("//div[@class='d_1']//li"?);??????????for?(Object?on?:?ns)?{??????????????TagNode?n?=?(TagNode)?on;??????????????System.out.println("\ttext="?+n.getText());??????????}??????????System.out.println("a:"?);??????????????????ns?=?node.getElementsByAttValue("name"?,??"my_href"?,??true?,??true?);??????????for?(Object?on?:?ns)?{??????????????TagNode?n?=?(TagNode)?on;??????????????System.out.println("\thref="?+n.getAttributeByName(?"href"?)+?",?text="?+n.getText());??????????}??????}??}?? cleaner.clean()中的參數,可以是文件,可以是url,可以是字符串內容。個人認為:比較常用的應該是evaluateXPath、 getElementsByAttValue、getElementsByName方法了。另外說明下,htmlcleaner 對不規范的html兼容性比較好。
?
轉載于:https://www.cnblogs.com/lchzls/p/6282704.html
總結
以上是生活随笔為你收集整理的优酷电视剧爬虫代码实现一:下载解析视频网站页面(3)补充知识点:htmlcleaner使用案例...的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。