信息检索及DM必备知识总结:luncene
原文鏈接:http://blog.csdn.net/htw2012/article/details/17734529
有少量修改!如有疑問,請訪問原作者.
一:信息檢索領域:
信息檢索和網絡數據領域(WWW, SIGIR, CIKM, WSDM, ACL, EMNLP等)的論文中常用的模型和技術總結(為什么概率是可靠的,概率隱藏了大部分事實,而給予我們可以看得見的部分.)
引子:對于這個領域的博士生來說,看懂論文是入行了解大家在做什么的研究基礎,通常我們會去看一本書。看一本書固然是好,但是有一個很大的缺點:一本書本身自成體系,所以包含太多東西,很多內容看了,但是實際上卻用不到。這雖然不能說是一種浪費,但是卻沒有把有限力氣花在刀口上。
我所處的領域是關于網絡數據的處理(國際會議WWW, SIGIR, CIKM,?WSDM, ACL, EMNLP,等)
我列了一個我自己認為的在我們這個領域常常遇到的模型或者技術的列表,希望對大家節省時間有所幫助:1. 概率論初步
??? 主要常用到如下概念:初等概率定義的三個條件,全概率公式,貝葉斯公式,鏈式法則,常用概率分布(Dirichlet 分布,高斯分布,多項式分布,玻松分布m)
雖然概率論的內容很多,但是在實際中用到的其實主要就是上述的幾個概念。基于測度論的高等概率論,幾大會議(www,sigir等等)中出現的論文中基本都不會出現。
2. 信息論基礎
??? 主要常用的概念:熵,條件熵,KL散度,以及這三者之間的關系,最大熵原理,信息增益(information gain)
3. 分類
??? 樸素貝葉斯,KNN,支持向量機,最大熵模型,決策樹的基本原理,以及優缺點,知道常用的軟件包
4. 聚類
??? 非層次聚類的K-means算法,層次聚類的類型及其區別,以及算距離的方法(如single,complete的區別a),知道常用的軟件包
5.?EM算法
??? 理解不完全數據的推斷的困難,理解EM原理和推理過程
6. 蒙特卡洛算法(特別是Gibbs采樣算法) ??? 知道蒙特卡洛算法的基本原理,特別了解Gibbs算法的采樣過程;Markov 隨機過程和Markov chain等
7. 圖模型
???? 圖模型最近幾年非常的熱,也非常重要,因為它能把之前的很多研究都包括在內,同時具有直觀之意義。如CRF,?HMM,topic model都是圖模型的應用和特例。
??? a. 了解圖模型的一般表示(有向圖和無向圖模型x),通用的學習算法(learning)和推斷算法(inference),如Sum-product算法,傳播算法等
??? b.? 熟悉HMM模型,包括它的假設條件,以及前向和后向算法;?
??? c.? 熟悉LDA模型,包括它的圖模型表示i,以及它的Gibbs 推理算法;變分推斷算法不要求掌握。
??? d. 了解CRF模型,主要是了解它的圖模型表示,如果有時間和興趣a,可以了解推理算法;
??? e.? 理解HMM,LDA, CRF和圖模型的一般表示,通用學習算法和推理算法之間的聯系和差別;
??? f.? 了解Markov logic network(MLN),這是建構在圖模型和一階邏輯基礎上的一種語言,可以用來描述很多現實問題,初步的了解,可以幫助理解圖模型;
8. topic model
??? 這個模型的思想被廣泛地應用,全看完沒有必有也沒有時間,推薦如下:
??? a. 深入理解pLSA和LDA,同時理解pLSA和LDA之間的聯系和區別;這兩個模型理解后,大部分的topic model的論文都是可以理解的了,特別是應用到NLP上的topic??
???????? model。同時,也可以自己設計自己需要的非層次topic model了。
??? b. 如果想繼續深入,繼續理解hLDA模型,特別是理解背后的數學原理Dirichlet Process,這樣你就可以自己設計層次topic model了;
??? c. 對于有監督的topic model,一定要理解s-LDA和LLDA兩個模型,這兩個模型體現了完全不同的設計思想,可以細細體會,然后自己設計自己需要的topic model;
??? d. 對于這些模型的理解,Gibbs 采樣算法是繞不開的坎;
9. 最優化和隨機過程
? ? a. 理解約束條件是等號的最優化問題及其lagrange乘子法求解;
? ? b. 理解約束條件是不等號的凸優化問題,理解單純形法;
? ? c. 理解梯度下降法,模擬退火算法;
? ? d. 理解爬山法等最優化求解的思想
? ? e. 隨機過程需要了解隨機游走,排隊論等基本隨機過程(論文中偶爾會有,但不是太常見n),理解Markov 隨機過程(非常重要,采樣理論中常用l);
10. 貝葉斯學習
?? 目前越來越多的方法或模型采用貝葉斯學派的思想來處理數據,因此了解相關的內容非常必要。
?? a.? 理解貝葉斯學派和統計學派的在思想和原理上的差別和聯系;
?? b.? 理解損失函數,及其在貝葉斯學習中的作用;記住常用的損失函數;
?? c.? 理解貝葉斯先驗的概念和四種常用的選取貝葉斯先驗的方法;
?? d.? 理解參數和超參數的概念,以及區別;
?? e.? 通過LDA的先驗選取(或者其它模型i)來理解貝葉斯數據處理的思想;
11. 信息檢索模型和工具
? ? a.? 理解常用的檢索模型;
??? b.? 了解常用的開源工具(lemur,lucene等ng) 12. 模型選擇和特征選取
??? a. 理解常用的特征選擇方法,從而選擇有效特征來訓練模型; ??? b. 看幾個模型選擇的例子,理解如何選擇一個合適模型;(這玩意只能通過例子來意會了) 13. 論文寫作中的tricks
??? 技巧是很多的,這里略。
二:lucene 加速檢索:
???
Here are some things to try to speed up the seaching speed of your Lucene application. Please seeImproveIndexingSpeed for how to speed up indexing.
-
Be sure you really need to speed things up. Many of the ideas here are simple to try, but others will necessarily add some complexity to your application. So be sure your searching speed is indeed too slow and the slowness is indeed within Lucene.
-
Make sure you are using the latest version of Lucene.
-
Use a local filesystem. Remote filesystems are typically quite a bit slower for searching. If the index must be remote, try to mount the remote filesystem as a "readonly" mount. In some cases this could improve performance.
-
Get faster hardware, especially a faster IO system.Flash-based Solid State Drives works very well for Lucene searches. As seek-times for SSD's are about 100 times faster than traditional platter-based harddrives, the usual penalty for seeking is virtually eliminated. This means that SSD-equipped machines need less RAM for file caching and that searchers require less warm-up time before they respond quickly.
-
Tune the OS
One tunable that stands out on Linux is swappiness (http://kerneltrap.org/node/3000), which controls how aggressively the OS will swap out RAM used by processes in favor of the IO Cache. Most Linux distros default this to a highish number (meaning, aggressive) but this can easily cause horrible search latency, especially if you are searching a large index with a low query rate. Experiment by turning swappiness down or off entirely (by setting it to 0). Windows also has a checkbox, under My Computer -> Properties -> Advanced -> Performance Settings -> Advanced -> Memory Usage, that lets you favor Programs or System Cache, that's likely doing something similar.
-
Open the IndexReader with readOnly=true. This makes a big difference when multiple threads are sharing the same reader, as it removes certain sources of thread contention.
-
On non-Windows platform, using NIOFSDirectory instead of FSDirectory.
This also removes sources of contention when accessing the underlying files. Unfortunately, due to a longstanding bug on Windows in Sun's JRE (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734 -- feel particularly free to go vote for it), NIOFSDirectory gets poor performance on Windows.
-
Add RAM to your hardware and/or increase the heap size for the JVM.For a large index, searching can use alot of RAM. If you don't have enough RAM or your JVM is not running with a large enough HEAP size then the JVM can hit swapping and thrashing at which point everything will run slowly.
-
Use one instance of IndexSearcher.
Share a single IndexSearcher across queries and across threads in your application.
-
When measuring performance, disregard the first query.
The first query to a searcher pays the price of initializing caches (especially when sorting by fields) and thus will skew your results (assuming you re-use the searcher for many queries). On the other hand, if you re-run the same query again and again, results won't be realistic either, because the operating system will use its cache to speed up IO operations. On Linux (kernel 2.6.16 and later) you can clean the disk cache usingsync?;?echo?3?>?/proc/sys/vm/drop_caches. Seehttp://linux-mm.org/Drop_Caches for details.
-
Re-open the IndexSearcher only when necessary.
You must re-open the IndexSearcher in order to make newly committed changes visible to searching. However, re-opening the searcher has a certain overhead (noticeable mostly with large indexes and with sorting turned on) and should thus be minimized. Consider using a so calledwarming technique which allows the searcher to warm up its caches before the first query hits.
-
Decrease mergeFactor. Smaller mergeFactors mean fewer segments and searching will be faster. However, this will slow down indexing speed, so you should test values to strike an appropriate balance for your application.
-
Limit usage of stored fields and term vectors.Retrieving these from the index is quite costly. Typically you should only retrieve these for the current "page" the user will see, not for all documents in the full result set. For each document retrieved, Lucene must seek to a different location in various files. Try sorting the documents you need to retrieve by docID order first.
-
Use FieldSelector to carefully pick which fields are loaded, and how they are loaded, when you retrieve a document.
-
Don't iterate over more hits than needed.
Iterating over all hits is slow for two reasons. Firstly, the search() method that returns a Hits object re-executes the search internally when you need more than 100 hits. Solution: use the search method that takes a HitCollector instead. Secondly, the hits will probably be spread over the disk so accessing them all requires much I/O activity. This cannot easily be avoided unless the index is small enough to be loaded into RAM. If you don't need the complete documents but only one (small) field you could also use the FieldCache class to cache that one field and have fast access to it.
-
When using fuzzy queries use a minimum prefix length.
Fuzzy queries perform CPU-intensive string comparisons - avoid comparing all unique terms with the user input by only examining terms starting with the first "N" characters. This prefix length is a property on bothQueryParser and FuzzyQuery - default is zero so ALL terms are compared.
-
Consider using filters. It can be much more efficient to restrict results to a part of the index using a cached bit set filter rather than using a query clause. This is especially true for restrictions that match a great number of documents of a large index. Filters are typically used to restrict the results to a category but could in many cases be used to replace any query clause. One difference between using a Query and a Filter is that the Query has an impact on the score while a Filter does not.
-
Find the bottleneck.
Complex query analysis or heavy post-processing of results are examples of hidden bottlenecks for searches. Profiling with at tool such asVisualVM helps locating the problem
總結
以上是生活随笔為你收集整理的信息检索及DM必备知识总结:luncene的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 「重点」如何做格力空调代理
- 下一篇: 窗帘轨道中隐藏轨道的相关知识