两个基于爬虫的项目: Kiwix ArchiveBox
我在之前的博文 “談?wù)勁老x(chóng)的昨天、今天和明天” 提到,爬蟲(chóng)技術(shù)是曾經(jīng)互聯(lián)網(wǎng)的基石,也是當(dāng)今互聯(lián)網(wǎng)技術(shù)的重要組成。未來(lái) PC 服務(wù)和移動(dòng)服務(wù)將產(chǎn)生功能上的分離,移動(dòng)端更加關(guān)注普羅大眾日常生活相關(guān)的功能(購(gòu)物、社交、娛樂(lè)等),而 PC 將回歸其本質(zhì),即工具屬性。
最近看帖子發(fā)現(xiàn)了兩個(gè)基于爬蟲(chóng)的開(kāi)源項(xiàng)目,雖然還殘留著 PC 時(shí)代項(xiàng)目痕跡,但個(gè)人覺(jué)得這兩個(gè)項(xiàng)目還是有一定的意義的,因?yàn)檫@兩個(gè)項(xiàng)目從某種程度上來(lái)說(shuō),更加關(guān)注 PC 的工具性。同時(shí)國(guó)內(nèi)關(guān)于這兩個(gè)項(xiàng)目的介紹寥寥。今天在這里和大家分享一下~
1. Kiwix
Kiwix 的官方宣傳是:在你的手機(jī)和電腦上輕松儲(chǔ)存 Wikipeida 和任何網(wǎng)站(Store Wikipedia or any website on your mobile phone or computer, easily)。
這個(gè)網(wǎng)站最初是用來(lái)做 Wikipeida 離線訪問(wèn)的,后來(lái)逐步擴(kuò)展到一些其他主流網(wǎng)站的離線訪問(wèn),例如 Project Gutenberg、 Stack Exchange、 YouTube、 Ted Talks
該項(xiàng)目的核心技術(shù)思路很直接、簡(jiǎn)單:
#mermaid-svg-XZbpTKTHfDYOGITG .label{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);fill:#333;color:#333}#mermaid-svg-XZbpTKTHfDYOGITG .label text{fill:#333}#mermaid-svg-XZbpTKTHfDYOGITG .node rect,#mermaid-svg-XZbpTKTHfDYOGITG .node circle,#mermaid-svg-XZbpTKTHfDYOGITG .node ellipse,#mermaid-svg-XZbpTKTHfDYOGITG .node polygon,#mermaid-svg-XZbpTKTHfDYOGITG .node path{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-XZbpTKTHfDYOGITG .node .label{text-align:center;fill:#333}#mermaid-svg-XZbpTKTHfDYOGITG .node.clickable{cursor:pointer}#mermaid-svg-XZbpTKTHfDYOGITG .arrowheadPath{fill:#333}#mermaid-svg-XZbpTKTHfDYOGITG .edgePath .path{stroke:#333;stroke-width:1.5px}#mermaid-svg-XZbpTKTHfDYOGITG .flowchart-link{stroke:#333;fill:none}#mermaid-svg-XZbpTKTHfDYOGITG .edgeLabel{background-color:#e8e8e8;text-align:center}#mermaid-svg-XZbpTKTHfDYOGITG .edgeLabel rect{opacity:0.9}#mermaid-svg-XZbpTKTHfDYOGITG .edgeLabel span{color:#333}#mermaid-svg-XZbpTKTHfDYOGITG .cluster rect{fill:#ffffde;stroke:#aa3;stroke-width:1px}#mermaid-svg-XZbpTKTHfDYOGITG .cluster text{fill:#333}#mermaid-svg-XZbpTKTHfDYOGITG div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:12px;background:#ffffde;border:1px solid #aa3;border-radius:2px;pointer-events:none;z-index:100}#mermaid-svg-XZbpTKTHfDYOGITG .actor{stroke:#ccf;fill:#ECECFF}#mermaid-svg-XZbpTKTHfDYOGITG text.actor>tspan{fill:#000;stroke:none}#mermaid-svg-XZbpTKTHfDYOGITG .actor-line{stroke:grey}#mermaid-svg-XZbpTKTHfDYOGITG .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333}#mermaid-svg-XZbpTKTHfDYOGITG .messageLine1{stroke-width:1.5;stroke-dasharray:2, 2;stroke:#333}#mermaid-svg-XZbpTKTHfDYOGITG #arrowhead path{fill:#333;stroke:#333}#mermaid-svg-XZbpTKTHfDYOGITG .sequenceNumber{fill:#fff}#mermaid-svg-XZbpTKTHfDYOGITG #sequencenumber{fill:#333}#mermaid-svg-XZbpTKTHfDYOGITG #crosshead path{fill:#333;stroke:#333}#mermaid-svg-XZbpTKTHfDYOGITG .messageText{fill:#333;stroke:#333}#mermaid-svg-XZbpTKTHfDYOGITG .labelBox{stroke:#ccf;fill:#ECECFF}#mermaid-svg-XZbpTKTHfDYOGITG .labelText,#mermaid-svg-XZbpTKTHfDYOGITG .labelText>tspan{fill:#000;stroke:none}#mermaid-svg-XZbpTKTHfDYOGITG .loopText,#mermaid-svg-XZbpTKTHfDYOGITG .loopText>tspan{fill:#000;stroke:none}#mermaid-svg-XZbpTKTHfDYOGITG .loopLine{stroke-width:2px;stroke-dasharray:2, 2;stroke:#ccf;fill:#ccf}#mermaid-svg-XZbpTKTHfDYOGITG .note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-XZbpTKTHfDYOGITG .noteText,#mermaid-svg-XZbpTKTHfDYOGITG .noteText>tspan{fill:#000;stroke:none}#mermaid-svg-XZbpTKTHfDYOGITG .activation0{fill:#f4f4f4;stroke:#666}#mermaid-svg-XZbpTKTHfDYOGITG .activation1{fill:#f4f4f4;stroke:#666}#mermaid-svg-XZbpTKTHfDYOGITG .activation2{fill:#f4f4f4;stroke:#666}#mermaid-svg-XZbpTKTHfDYOGITG .mermaid-main-font{font-family:"trebuchet ms", verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XZbpTKTHfDYOGITG .section{stroke:none;opacity:0.2}#mermaid-svg-XZbpTKTHfDYOGITG .section0{fill:rgba(102,102,255,0.49)}#mermaid-svg-XZbpTKTHfDYOGITG .section2{fill:#fff400}#mermaid-svg-XZbpTKTHfDYOGITG .section1,#mermaid-svg-XZbpTKTHfDYOGITG .section3{fill:#fff;opacity:0.2}#mermaid-svg-XZbpTKTHfDYOGITG .sectionTitle0{fill:#333}#mermaid-svg-XZbpTKTHfDYOGITG .sectionTitle1{fill:#333}#mermaid-svg-XZbpTKTHfDYOGITG .sectionTitle2{fill:#333}#mermaid-svg-XZbpTKTHfDYOGITG .sectionTitle3{fill:#333}#mermaid-svg-XZbpTKTHfDYOGITG .sectionTitle{text-anchor:start;font-size:11px;text-height:14px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XZbpTKTHfDYOGITG .grid .tick{stroke:#d3d3d3;opacity:0.8;shape-rendering:crispEdges}#mermaid-svg-XZbpTKTHfDYOGITG .grid .tick text{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XZbpTKTHfDYOGITG .grid path{stroke-width:0}#mermaid-svg-XZbpTKTHfDYOGITG .today{fill:none;stroke:red;stroke-width:2px}#mermaid-svg-XZbpTKTHfDYOGITG .task{stroke-width:2}#mermaid-svg-XZbpTKTHfDYOGITG .taskText{text-anchor:middle;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XZbpTKTHfDYOGITG .taskText:not([font-size]){font-size:11px}#mermaid-svg-XZbpTKTHfDYOGITG .taskTextOutsideRight{fill:#000;text-anchor:start;font-size:11px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XZbpTKTHfDYOGITG .taskTextOutsideLeft{fill:#000;text-anchor:end;font-size:11px}#mermaid-svg-XZbpTKTHfDYOGITG .task.clickable{cursor:pointer}#mermaid-svg-XZbpTKTHfDYOGITG .taskText.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-XZbpTKTHfDYOGITG .taskTextOutsideLeft.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-XZbpTKTHfDYOGITG .taskTextOutsideRight.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-XZbpTKTHfDYOGITG .taskText0,#mermaid-svg-XZbpTKTHfDYOGITG .taskText1,#mermaid-svg-XZbpTKTHfDYOGITG .taskText2,#mermaid-svg-XZbpTKTHfDYOGITG .taskText3{fill:#fff}#mermaid-svg-XZbpTKTHfDYOGITG .task0,#mermaid-svg-XZbpTKTHfDYOGITG .task1,#mermaid-svg-XZbpTKTHfDYOGITG .task2,#mermaid-svg-XZbpTKTHfDYOGITG .task3{fill:#8a90dd;stroke:#534fbc}#mermaid-svg-XZbpTKTHfDYOGITG .taskTextOutside0,#mermaid-svg-XZbpTKTHfDYOGITG .taskTextOutside2{fill:#000}#mermaid-svg-XZbpTKTHfDYOGITG .taskTextOutside1,#mermaid-svg-XZbpTKTHfDYOGITG .taskTextOutside3{fill:#000}#mermaid-svg-XZbpTKTHfDYOGITG .active0,#mermaid-svg-XZbpTKTHfDYOGITG .active1,#mermaid-svg-XZbpTKTHfDYOGITG .active2,#mermaid-svg-XZbpTKTHfDYOGITG .active3{fill:#bfc7ff;stroke:#534fbc}#mermaid-svg-XZbpTKTHfDYOGITG .activeText0,#mermaid-svg-XZbpTKTHfDYOGITG .activeText1,#mermaid-svg-XZbpTKTHfDYOGITG .activeText2,#mermaid-svg-XZbpTKTHfDYOGITG .activeText3{fill:#000 !important}#mermaid-svg-XZbpTKTHfDYOGITG .done0,#mermaid-svg-XZbpTKTHfDYOGITG .done1,#mermaid-svg-XZbpTKTHfDYOGITG .done2,#mermaid-svg-XZbpTKTHfDYOGITG .done3{stroke:grey;fill:#d3d3d3;stroke-width:2}#mermaid-svg-XZbpTKTHfDYOGITG .doneText0,#mermaid-svg-XZbpTKTHfDYOGITG .doneText1,#mermaid-svg-XZbpTKTHfDYOGITG .doneText2,#mermaid-svg-XZbpTKTHfDYOGITG .doneText3{fill:#000 !important}#mermaid-svg-XZbpTKTHfDYOGITG .crit0,#mermaid-svg-XZbpTKTHfDYOGITG .crit1,#mermaid-svg-XZbpTKTHfDYOGITG .crit2,#mermaid-svg-XZbpTKTHfDYOGITG .crit3{stroke:#f88;fill:red;stroke-width:2}#mermaid-svg-XZbpTKTHfDYOGITG .activeCrit0,#mermaid-svg-XZbpTKTHfDYOGITG .activeCrit1,#mermaid-svg-XZbpTKTHfDYOGITG .activeCrit2,#mermaid-svg-XZbpTKTHfDYOGITG .activeCrit3{stroke:#f88;fill:#bfc7ff;stroke-width:2}#mermaid-svg-XZbpTKTHfDYOGITG .doneCrit0,#mermaid-svg-XZbpTKTHfDYOGITG .doneCrit1,#mermaid-svg-XZbpTKTHfDYOGITG .doneCrit2,#mermaid-svg-XZbpTKTHfDYOGITG .doneCrit3{stroke:#f88;fill:#d3d3d3;stroke-width:2;cursor:pointer;shape-rendering:crispEdges}#mermaid-svg-XZbpTKTHfDYOGITG .milestone{transform:rotate(45deg) scale(0.8, 0.8)}#mermaid-svg-XZbpTKTHfDYOGITG .milestoneText{font-style:italic}#mermaid-svg-XZbpTKTHfDYOGITG .doneCritText0,#mermaid-svg-XZbpTKTHfDYOGITG .doneCritText1,#mermaid-svg-XZbpTKTHfDYOGITG .doneCritText2,#mermaid-svg-XZbpTKTHfDYOGITG .doneCritText3{fill:#000 !important}#mermaid-svg-XZbpTKTHfDYOGITG .activeCritText0,#mermaid-svg-XZbpTKTHfDYOGITG .activeCritText1,#mermaid-svg-XZbpTKTHfDYOGITG .activeCritText2,#mermaid-svg-XZbpTKTHfDYOGITG .activeCritText3{fill:#000 !important}#mermaid-svg-XZbpTKTHfDYOGITG .titleText{text-anchor:middle;font-size:18px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XZbpTKTHfDYOGITG g.classGroup text{fill:#9370db;stroke:none;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:10px}#mermaid-svg-XZbpTKTHfDYOGITG g.classGroup text .title{font-weight:bolder}#mermaid-svg-XZbpTKTHfDYOGITG g.clickable{cursor:pointer}#mermaid-svg-XZbpTKTHfDYOGITG g.classGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-XZbpTKTHfDYOGITG g.classGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-XZbpTKTHfDYOGITG .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5}#mermaid-svg-XZbpTKTHfDYOGITG .classLabel .label{fill:#9370db;font-size:10px}#mermaid-svg-XZbpTKTHfDYOGITG .relation{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-XZbpTKTHfDYOGITG .dashed-line{stroke-dasharray:3}#mermaid-svg-XZbpTKTHfDYOGITG #compositionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-XZbpTKTHfDYOGITG #compositionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-XZbpTKTHfDYOGITG #aggregationStart{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-XZbpTKTHfDYOGITG #aggregationEnd{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-XZbpTKTHfDYOGITG #dependencyStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-XZbpTKTHfDYOGITG #dependencyEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-XZbpTKTHfDYOGITG #extensionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-XZbpTKTHfDYOGITG #extensionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-XZbpTKTHfDYOGITG .commit-id,#mermaid-svg-XZbpTKTHfDYOGITG .commit-msg,#mermaid-svg-XZbpTKTHfDYOGITG .branch-label{fill:lightgrey;color:lightgrey;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XZbpTKTHfDYOGITG .pieTitleText{text-anchor:middle;font-size:25px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XZbpTKTHfDYOGITG .slice{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XZbpTKTHfDYOGITG g.stateGroup text{fill:#9370db;stroke:none;font-size:10px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XZbpTKTHfDYOGITG g.stateGroup text{fill:#9370db;fill:#333;stroke:none;font-size:10px}#mermaid-svg-XZbpTKTHfDYOGITG g.statediagram-cluster .cluster-label text{fill:#333}#mermaid-svg-XZbpTKTHfDYOGITG g.stateGroup .state-title{font-weight:bolder;fill:#000}#mermaid-svg-XZbpTKTHfDYOGITG g.stateGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-XZbpTKTHfDYOGITG g.stateGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-XZbpTKTHfDYOGITG .transition{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-XZbpTKTHfDYOGITG .stateGroup .composit{fill:white;border-bottom:1px}#mermaid-svg-XZbpTKTHfDYOGITG .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px}#mermaid-svg-XZbpTKTHfDYOGITG .state-note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-XZbpTKTHfDYOGITG .state-note text{fill:black;stroke:none;font-size:10px}#mermaid-svg-XZbpTKTHfDYOGITG .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.7}#mermaid-svg-XZbpTKTHfDYOGITG .edgeLabel text{fill:#333}#mermaid-svg-XZbpTKTHfDYOGITG .stateLabel text{fill:#000;font-size:10px;font-weight:bold;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XZbpTKTHfDYOGITG .node circle.state-start{fill:black;stroke:black}#mermaid-svg-XZbpTKTHfDYOGITG .node circle.state-end{fill:black;stroke:white;stroke-width:1.5}#mermaid-svg-XZbpTKTHfDYOGITG #statediagram-barbEnd{fill:#9370db}#mermaid-svg-XZbpTKTHfDYOGITG .statediagram-cluster rect{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-XZbpTKTHfDYOGITG .statediagram-cluster rect.outer{rx:5px;ry:5px}#mermaid-svg-XZbpTKTHfDYOGITG .statediagram-state .divider{stroke:#9370db}#mermaid-svg-XZbpTKTHfDYOGITG .statediagram-state .title-state{rx:5px;ry:5px}#mermaid-svg-XZbpTKTHfDYOGITG .statediagram-cluster.statediagram-cluster .inner{fill:white}#mermaid-svg-XZbpTKTHfDYOGITG .statediagram-cluster.statediagram-cluster-alt .inner{fill:#e0e0e0}#mermaid-svg-XZbpTKTHfDYOGITG .statediagram-cluster .inner{rx:0;ry:0}#mermaid-svg-XZbpTKTHfDYOGITG .statediagram-state rect.basic{rx:5px;ry:5px}#mermaid-svg-XZbpTKTHfDYOGITG .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#efefef}#mermaid-svg-XZbpTKTHfDYOGITG .note-edge{stroke-dasharray:5}#mermaid-svg-XZbpTKTHfDYOGITG .statediagram-note rect{fill:#fff5ad;stroke:#aa3;stroke-width:1px;rx:0;ry:0}:root{--mermaid-font-family: '"trebuchet ms", verdana, arial';--mermaid-font-family: "Comic Sans MS", "Comic Sans", cursive}#mermaid-svg-XZbpTKTHfDYOGITG .error-icon{fill:#522}#mermaid-svg-XZbpTKTHfDYOGITG .error-text{fill:#522;stroke:#522}#mermaid-svg-XZbpTKTHfDYOGITG .edge-thickness-normal{stroke-width:2px}#mermaid-svg-XZbpTKTHfDYOGITG .edge-thickness-thick{stroke-width:3.5px}#mermaid-svg-XZbpTKTHfDYOGITG .edge-pattern-solid{stroke-dasharray:0}#mermaid-svg-XZbpTKTHfDYOGITG .edge-pattern-dashed{stroke-dasharray:3}#mermaid-svg-XZbpTKTHfDYOGITG .edge-pattern-dotted{stroke-dasharray:2}#mermaid-svg-XZbpTKTHfDYOGITG .marker{fill:#333}#mermaid-svg-XZbpTKTHfDYOGITG .marker.cross{stroke:#333}:root { --mermaid-font-family: "trebuchet ms", verdana, arial;}#mermaid-svg-XZbpTKTHfDYOGITG {color: rgba(0, 0, 0, 0.75);font: ;}服務(wù)器定期爬取網(wǎng)站的最新內(nèi)容轉(zhuǎn)成特定的格式
(zimfile)用戶在有網(wǎng)絡(luò)時(shí)
下載離線內(nèi)容用戶在無(wú)網(wǎng)時(shí)也可以
查看離線內(nèi)容
其中,zimfile 相關(guān)的代碼是通過(guò) C++ 進(jìn)行開(kāi)發(fā)的,而爬蟲(chóng)部分則是通過(guò) python 實(shí)現(xiàn)的。
雖然看起來(lái)很簡(jiǎn)單,但實(shí)際上有很多技術(shù)難點(diǎn),例如我們?cè)?Wikipeida 里搜索一個(gè)東西可以得到很快的響應(yīng),這得益于 ES 等工具軟件,但我們不可能在用戶的電腦上安裝一個(gè) ES 吧,這樣做速度和數(shù)據(jù)體積都會(huì)變得很大。如果仔細(xì)查看 Kiwix 的 github 相關(guān)倉(cāng)庫(kù),會(huì)發(fā)現(xiàn)其相關(guān)工程非常多,是一個(gè)不折不扣的大型軟件項(xiàng)目!
作為一名程序員,我看到該項(xiàng)目的第一反應(yīng)是這對(duì)于很多只能離線編程的程序員兄弟簡(jiǎn)直是福音。(一些項(xiàng)目因?yàn)榘踩枨笾荒苓M(jìn)行封閉開(kāi)發(fā),無(wú)法連接互聯(lián)網(wǎng)!)
然而實(shí)際上該項(xiàng)目目標(biāo)更加宏大:方便地向這個(gè)世界上無(wú)法使用網(wǎng)絡(luò)的地方傳播知識(shí)和文化!
無(wú)法使用網(wǎng)絡(luò)?這可真不是開(kāi)玩笑。我們看下面這張圖,是 2017 年世界各地可以使用網(wǎng)絡(luò)的人口比例:
 
對(duì)于非洲地區(qū),很多地方由于基礎(chǔ)設(shè)施缺失無(wú)法使用網(wǎng)絡(luò);其他一些地區(qū)由于政治原因,網(wǎng)絡(luò)被管控;還有一些地區(qū)網(wǎng)絡(luò)費(fèi)用高昂,阻礙了大眾獲取知識(shí)。
而 Kiwix 項(xiàng)目甚至通過(guò)一個(gè) U 盤(pán)就可以將思想進(jìn)行傳播,我覺(jué)得應(yīng)該點(diǎn)個(gè)贊。
2. ArchiveBox
ArchiveBox 本身是一個(gè)制作網(wǎng)頁(yè) (站) 即時(shí)鏡像的工具,這點(diǎn)和 Kiwix 有異曲同工之妙。但是 ArchiveBox 更加通用與小巧一些,可以把你想靜態(tài)化的任何網(wǎng)站進(jìn)行靜態(tài)化,包括文本、圖片、PDF 甚至視頻。
技術(shù)上來(lái)講,ArchiveBox 雖然技術(shù)品類比 Kiwix 多很多,用到了 wget、Chrome headless、youtube-dl、pywb、readability 等,但這些畢竟都是爬蟲(chóng)常用的技術(shù),感覺(jué)并不復(fù)雜。
實(shí)際操作了官方的 docker 鏡像后,發(fā)現(xiàn)其爬蟲(chóng)功能做得比較完備,以后有時(shí)間可以深入研究一下(不清楚為什么一個(gè)簡(jiǎn)單的網(wǎng)頁(yè)他會(huì)處理很久…)。
軟件截圖如下:
 其中:
- Example Domain  - 示例網(wǎng)頁(yè),展示效果較好 ?
 
- b 站:劉備斗舞謝廣坤 - 網(wǎng)頁(yè)圖片和視頻都無(wú)法正確處理 ?
- 標(biāo)題識(shí)別有誤(因?yàn)檐浖R(shí)別的是 meta 里的標(biāo)題信息),也未提供修改功能 ?
 
- csdn:如何創(chuàng)作在頁(yè)面嵌入一個(gè) “無(wú)法被下載” 的 PDF 文檔 - 頁(yè)面內(nèi)容可以較好保留 ?
- 標(biāo)題和 b 站同樣的問(wèn)題 ?
- 打開(kāi)幾秒后發(fā)生了跳轉(zhuǎn)到 csdn 主頁(yè)的行為,說(shuō)明 JS 未能處理好 ?
 
相較于普通的瀏覽器書(shū)簽,保存網(wǎng)站在瀏覽時(shí)的即時(shí)狀態(tài),可以很好地應(yīng)對(duì)帖子被刪除,甚至網(wǎng)站關(guān)閉這些特殊情況。但該軟件總體來(lái)看 bug 較多,屬于一個(gè)半成品狀態(tài)。
3. 總結(jié)
本文介紹的兩個(gè)項(xiàng)目,均是基于爬蟲(chóng)技術(shù)的比較有意義的項(xiàng)目。今后如果再遇到一些讓人眼前一亮的項(xiàng)目,會(huì)繼續(xù)和大家分享~
更多資料
- How to Fit All Human Knowledge in a Box
- 離線維基閱讀工具—— Kiwix(閱讀器)介紹
總結(jié)
以上是生活随笔為你收集整理的两个基于爬虫的项目: Kiwix ArchiveBox的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
 
                            
                        - 上一篇: 第14章 Beta测试
- 下一篇: Android8.0以上打开相机并裁剪图
