heritrix3.x--SURT / 限定heritrix的爬行域
在heritrix3.x的CXML文件中經常出現surt這個屬性,這個屬性到底是什么呢,因為是一個縮寫,而且比較小眾,從字面上看不出意思,還是來看下官方的完整解釋吧:
Sort-friendly?URI?Reordering?Transform.?Converts?URIs?of?the?form:?scheme://userinfo@domain.tld:port/path?query#fragment?...into...?scheme://(tld,domain,:port@userinfo)/path?query#fragment?The?'('?')'?characters?serve?as?an?unambiguous?notice?that?the?so-called?'authority'?portion?of?the?URI?([userinfo@]host[:port]?in?http?URIs)?has?been?transformed;?the?commas?prevent?confusion?with?regular?hostnames.?This?remedies?the?'problem'?with?standard?URIs?that?the?host?portion?of?a?regular?URI,?with?its?dotted-domains,?is?actually?in?reverse?order?from?the?natural?hierarchy?that's?usually?helpful?for?grouping?and?sorting.?The?value?of?respecting?URI?case?variance?is?considered?negligible:?it?is?vanishingly?rare?for?case-variance?to?be?meaningful,?while?URI?case-?variance?often?arises?from?people's?confusion?or?sloppiness,?and?they?only?correct?it?insofar?as?necessary?to?avoid?blatant?problems.?Thus?the?usual?SURT?form?is?considered?to?be?flattened?to?all?lowercase,?and?not?completely?reversible.
地址為:http://crawler.archive.org/apidocs/org/archive/util/SURT.html
?
各類人體藝術寫真、攝影、模特攝影、寫真照片????
?
簡單的說,意思是將傳統的點號域名轉化為另一種避免歧義的域名格式了,在配置文件中應該會用到。
?
配置實例:http://tech.groups.yahoo.com/group/archive-crawler/message/7375
?
各類人體藝術寫真、攝影、模特攝影、寫真照片????
<bean?class="org.archive.modules.deciderules.DecideRuleSequence">
<property?name="rules">
<list>
<bean?class="org.archive.modules.deciderules.RejectDecideRule"?/>
<bean
class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
<property?name="seedsAsSurtPrefixes"?value="false"?/>
<property?name="surtsSource">
<bean?class="org.archive.spring.ConfigString">
<property?name="value">
<value>
+http://(com,blogs,test,)/between_the_lines/page
+http://(com,blogs,test,)/between_the_lines/archive
</value>
</property>
</bean>
</property>
</bean>
<bean?class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
<property?name="regexList">
<list>
<value>^http://test\.blogs\.com/between_the_lines/$</value>
<value>^.*index.html*$</value>
</list>
</property>
</bean>
</list>
</property>
</bean>
上述配置的效果是:爬行下列目錄中包含index.html的頁面
http://test.blogs.com/between_the_lines/
>
>?http://test.blogs.com/between_the_lines/page*
>
>?http://test.blogs.com/between_the_lines/archives*
?————————————————————————————————————————
經測試,surtsSource下限定的爬行域名解析當前頁面,并仍然會爬到外鏈(有待進一步求解)
?
?
各類人體藝術寫真、攝影、模特攝影、寫真照片????
?
? 具體的做法如下:
??? 1.在org.archive.crawler.frontier下新建一個ELFHashQueueAssignmentPolicy類,這個類要注意繼承自 QueueAssignmentPolicy。
??? 2.在該類下編寫代碼如下:
1. publicclass ELFHashQueueAssignmentPolicyextends QueueAssignmentPolicy
2.? {
3. ??? privatestatic finalLogger logger= Logger?
4. ??? .getLogger(ELFHashQueueAssignmentPolicy .class.getName());
5. ?
6. ??? publicString getClassKey(CrawlController controller,??
7.????????CandidateURI cauri){?
8. ??? ??? String uri = cauri.getUURI().toString();?
9. ??? ???long hash = ELFHash(uri);?
10.?????????????????String a = Long.toString(hash % 100);?
11.?????????????????returna;?
12.?????????????}?
13.????????????publiclong ELFHash(String str){?
14.????????????????long hash = 0;?
15.????????????????long x = 0;?
16.????????????????for(inti = 0; i < str.length(); i++){?
17.???????????????????? hash = (hash << 4) + str.charAt(i);?
18.????????????????????if((x = hash & 0xF0000000L) != 0){?
19.???????????????????????? hash ^= (x >> 24);?
20.?????????????????????????hash &= ~x;?
21.???????????????????? }?
22.???????????????? }?
23.????????????????return (hash & 0x7FFFFFFF);?
24.????????????}?
?
各類人體藝術寫真、攝影、模特攝影、寫真照片????
總結
以上是生活随笔為你收集整理的heritrix3.x--SURT / 限定heritrix的爬行域的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Java计算机毕业设计糖果销售管理系统源
- 下一篇: 创意球形效果图片