當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Latent Semantic Analysis (LSA) Tutorial

發布時間：2023/12/15 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 Latent Semantic Analysis (LSA) Tutorial 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

本文轉載于：http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html

LSA 也被稱為 latent Semantic Indexing，LSI，可以用來分析文檔內部的意義或者文檔中的concept。

如果一個 word 只對應一個 concept，并且一個 concept 只描述一個 word，那么 LSI 將會變得非常容易，因為只需要簡單在 words 和 concepts 之間建立一個一一映射，如下圖：

不幸的是，實際上，words 和 concepts 之間不是簡單的一一映射，而是多對多的映射，如下圖：

LSI 是如何運作的呢？

LSI 起源是為了解決如下這個問題：如何使用 search words 找到相關的 documents；當我們通過比較 words 尋找相關的 documents 時，實際上想要比較 words 真正的含義，而非僅僅是形式上的不同；LSI 通過把 words 和 documents 映射到一個 concept space，然后在這個 space 里面進行比較，從而解決這個問題。

由于作者寫作的時候，對于 words 的使用有多種選擇，對于同一個 concept，由于不同的作者選擇不同的 words，可能導致 concepts 模糊不清。這種對于 words 的隨機選擇，導致在 concept-word 映射關系里產生 noises。LSI 可以過濾掉一些噪音，并且試圖找到能夠跨越所有的 documents 的最小的一組 concepts。

為了解決這個問題，LSI 使用下面的一些簡化：

1. documents 被表示為 “bags of words”，words 在 document 中的順序是不重要的，只需要考慮 words 在 document 中出現的頻率

2. concept 被表示為一組 words，這些 words 頻繁地同時出現在 documents 中，For example "leash", "treat", and "obey" might usually appear in documents about dog training

3. 假設每個 word 都只有一個意思

一個簡單的例子

在這個例子里，我嘗試在 Amazon.com 使用 “investing”搜索書籍，然后取返回結果的前10個作為測試數據；因為有一本書的 title 與其他書籍的 titles 只有一個共同的 index word，所以被去掉了；index word 的定義如下：

1. 出現在 2個或者更過的書籍 title 中

2. 不是 stop words，例如, “and”，“the”

這個例子里，我們剔除這些 stop words：“and”, “edition”, “for”, “in”, “little”, “of”, “the”, “to”.

下面是剩下的9個 titles，index words 加了下劃線：

The Neatest Little Guide to StockMarketInvesting

Investing For Dummies, 4th Edition

The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of StockMarket Returns

The Little Book of ValueInvesting

ValueInvesting: From Graham to Buffett and Beyond

RichDad'sGuide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!

Investing in RealEstate, 5th Edition

StockInvesting For Dummies

RichDad's Advisors: The ABC's of RealEstateInvesting: The Secrets of Finding Hidden Profits Most Investors Miss

使用 LSI 分析這些 titles 后，我們可以在 XY坐標系里標記出 index words 的位置以及它們所屬的 clusters；9個 titles 使用藍色圓圈表示，11個 index words 使用紅色方塊表示，我們不僅可以畫出 titles 所屬的 clusters，而且可以給這些 clusters 打上 label，因為 index words 可以和 titles 在畫在一起，如下圖，藍色的 cluster 代表 real estate，包含 titles T7 和 T9；藍色的 cluster 是關于 value investing，包含 T2，T4，T5, 和 T8；紅色的 cluster 代表 stock market，包含 T1 和 T3,；T6 代表的 title 是一個 outlier

下面將分幾步介紹使用 LSI 的幾個步驟

Part 1 -- 創建 count matrix

第一步是創建 word by title matrix，每一個 index word 是一行，每一個 title 是一列；matrix 的每一項的值是對應的 word 在對應的 title 中出現的次數；一般的，這個 matrix 是很大的，但很稀疏，大部分項都是 0，下圖中 0沒有寫出來

Index Words	Titles
	T1	T2	T3	T4	T5	T6	T7	T8	T9
book			1	1
dads						1			1
dummies		1						1
estate							1		1
guide	1					1
investing	1	1	1	1	1	1	1	1	1
market	1		1
real							1		1
rich						2			1
stock	1		1					1
value				1	1

Python 代碼實現及介紹

Python - Getting Started

Download the python code here.

Throughout this article, we'll give Python code that implements all the steps necessary for doing Latent Semantic Analysis. We'll go through the code section by section and explain everything. The Python code used in this article can be downloaded here and then run in Python. You need to have already installed the Python NumPy and SciPy libraries.

Python - Import Functions

First we need to import a few functions from Python libraries to handle some of the math we need to do. NumPy is the Python numerical library, and we'll import zeros, a function that creates a matrix of zeros that we use when building our words by titles matrix. From the linear algebra part of the scientific package (scipy.linalg) we import the svd function that actually does the singular value decomposition, which is the heart of LSA.

from numpy import zeros
from scipy.linalg import svd

Python - Define Data

Next, we define the data that we are using. Titles holds the 9 book titles that we have gathered, stopwords holds the 8 common words that we are going to ignore when we count the words in each title, and ignorechars has all the punctuation characters that we will remove from words. We use Python's triple quoted strings, so there are actually only 4 punctuation symbols we are removing: comma (,), colon (:), apostrophe ('), and exclamation point (!).

titles =[
"The Neatest Little Guide to Stock Market Investing",
"Investing For Dummies, 4th Edition",
"The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns",
"The Little Book of Value Investing",
"Value Investing: From Graham to Buffett and Beyond",
"Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!",
"Investing in Real Estate, 5th Edition",
"Stock Investing For Dummies",
"Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss"
]stopwords = ['and','edition','for','in','little','of','the','to']
ignorechars = ''',:'!'''

Python - Define LSA Class

The LSA class has methods for initialization, parsing documents, building the matrix of word counts, and calculating. The first method is the __init__ method, which is called whenever an instance of the LSA class is created. It stores the stopwords and ignorechars so they can be used later, and then initializes the word dictionary and the document count variables.

class LSA(object):def __init__(self, stopwords, ignorechars):self.stopwords = stopwords
self.ignorechars = ignorechars
self.wdict = {}
self.dcount = 0

Python - Parse Documents

The parse method takes a document, splits it into words, removes the ignored characters and turns everything into lowercase so the words can be compared to the stop words. If the word is a stop word, it is ignored and we move on to the next word. If it is not a stop word, we put the word in the dictionary, and also append the current document number to keep track of which documents the word appears in.

The documents that each word appears in are kept in a list associated with that word in the dictionary. For example, since the word book appears in titles 3 and 4, we would have self.wdict['book'] = [3, 4] after all titles are parsed.

After processing all words from the current document, we increase the document count in preparation for the next document to be parsed.

def parse(self, doc):words = doc.split();
for w in words:w = w.lower().translate(None, self.ignorechars)
if w in self.stopwords:continueelif w in self.wdict:self.wdict[w].append(self.dcount)else:self.wdict[w] = [self.dcount]self.dcount += 1

Python - Build the Count Matrix

Once all documents are parsed, all the words (dictionary keys) that are in more than 1 document are extracted and sorted, and a matrix is built with the number of rows equal to the number of words (keys), and the number of columns equal to the document count. Finally, for each word (key) and document pair the corresponding matrix cell is incremented.

def build(self):self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1]
self.keys.sort()
self.A = zeros([len(self.keys), self.dcount])
for i, k in enumerate(self.keys):for d in self.wdict[k]:self.A[i,d] += 1

Python - Print the Count Matrix

The printA() method is very simple, it just prints out the matrix that we have built so it can be checked.

def printA(self):print self.A

Python - Test the LSA Class

After defining the LSA class, it's time to try it out on our 9 book titles. First we create an instance of LSA, called mylsa, and pass it the stopwords and ignorechars that we defined. During creation, the __init__ method is called which stores the stopwords and ignorechars and initializes the word dictionary and document count.

Next, we call the parse method on each title. This method extracts the words in each title, strips out punctuation characters, converts each word to lower case, throws out stop words, and stores remaining words in a dictionary along with what title number they came from.

Finally we call the build() method to create the matrix of word by title counts. This extracts all the words we have seen so far, throws out words that occur in less than 2 titles, sorts them, builds a zero matrix of the right size, and then increments the proper cell whenever a word appears in a title.

mylsa = LSA(stopwords, ignorechars)
for t in titles:mylsa.parse(t)mylsa.build()
mylsa.printA()

Here is the raw output produced by printA(). As you can see, it's the same as the matrix that we showed earlier.

[[ 0. 0. 1. 1. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 1. 0. 0. 1.]
[ 0. 1. 0. 0. 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 1.]
[ 1. 0. 0. 0. 0. 1. 0. 0. 0.]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[ 1. 0. 1. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 1.]
[ 0. 0. 0. 0. 0. 2. 0. 0. 1.]
[ 1. 0. 1. 0. 0. 0. 0. 1. 0.]
[ 0. 0. 0. 1. 1. 0. 0. 0. 0.]]

Part 2 - Modify the Counts with TFIDF

In sophisticated Latent Semantic Analysis systems, the raw matrix counts are usually modified so that rare words are weighted more heavily than common words. For example, a word that occurs in only 5% of the documents should probably be weighted more heavily than a word that occurs in 90% of the documents. The most popular weighting is TFIDF (Term Frequency - Inverse Document Frequency). Under this method, the count in each cell is replaced by the following formula.

TFIDF_i,j = ( N_i,j / N_*,j ) * log( D / D_i ) where

N_i,j = the number of times word i appears in document j (the original cell count).
N_*,j = the number of total words in document j (just add the counts in column j).
D = the number of documents (the number of columns).
D_i = the number of documents in which word i appears (the number of non-zero columns in row i).

In this formula, words that concentrate in certain documents are emphasized (by the N_i,j / N_*,jratio) and words that only appear in a few documents are also emphasized (by the log( D / D_i ) term).

Since we have such a small example, we will skip this step and move on the heart of LSA, doing the singular value decomposition of our matrix of counts. However, if we did want to add TFIDF to our LSA class we could add the following two lines at the beginning of our python file to import the log, asarray, and sum functions.

from math import log
from numpy import asarray, sum

Then we would add the following TFIDF method to our LSA class. WordsPerDoc (N_*,j) just holds the sum of each column, which is the total number of index words in each document. DocsPerWord (D_i) uses asarray to create an array of what would be True and False values, depending on whether the cell value is greater than 0 or not, but the 'i' argument turns it into 1's and 0's instead. Then each row is summed up which tells us how many documents each word appears in. Finally, we just step through each cell and apply the formula. We do have to change cols (which is the number of documents) into a float to prevent integer division.

def TFIDF(self):WordsPerDoc = sum(self.A, axis=0)
DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)
rows, cols = self.A.shape
for i in range(rows):for j in range(cols):self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])

Part 3 - Using the Singular Value Decomposition

Once we have built our (words by titles) matrix, we call upon a powerful but little known technique called Singular Value Decomposition or SVD to analyze the matrix for us. The "Singular Value Decomposition Tutorial" is a gentle introduction for readers that want to learn more about this powerful and useful algorithm.

The reason SVD is useful, is that it finds a reduced dimensional representation of our matrix that emphasizes the strongest relationships and throws away the noise. In other words, it makes the best possible reconstruction of the matrix with the least possible information. To do this, it throws out noise, which does not help, and emphasizes strong patterns and trends, which do help. The trick in using SVD is in figuring out how many dimensions or "concepts" to use when approximating the matrix. Too few dimensions and important patterns are left out, too many and noise caused by random word choices will creep back in.

The SVD algorithm is a little involved, but fortunately Python has a library function that makes it simple to use. By adding the one line method below to our LSA class, we can factor our matrix into 3 other matrices. The U matrix gives us the coordinates of each word on our “concept” space, the Vt matrix gives us the coordinates of each document in our “concept” space, and the S matrix of singular values gives us a clue as to how many dimensions or “concepts” we need to include.

def calc(self):self.U, self.S, self.Vt = svd(self.A)

In order to choose the right number of dimensions to use, we can make a histogram of the square of the singular values. This graphs the importance each singular value contributes to approximating our matrix. Here is the histogram in our example.

對于很大的 documents 集合，我們一般選擇 100-500 個 dimensions；在我們這個小例子中，由于我們想能夠更好的畫出示意圖，我們僅使用3個 dimensions，并且扔到第一個 dimension，畫出第二個和第三個 dimensions

我們為什么要扔掉第一個dimension 呢？因為，對于 documents，第一個 dimension 和 document 的長度是相關的，而對于 words，第一個 dimension 是和 word 在所有的 documents 中出現的次數相關；但是如果我們讓matrix 的每一列減去該列的平均值，從而對 matrix 進行 center 操作，那么我們就可以使用第一個 dimension

但是我們在使用 LSI 時一般不對 matrix 進行 center，因為 LSI 會把一個 sparse matrix 轉換為一個 dense matrix，并且會大幅度的增加內存和計算的消耗，所以不對 matrix 進行 center 操作，將第一個 dimension 丟棄，會提高效率

下面是我們的 matrix 的完整的3個dimension 的 Singular Value Decompostion 的結果，每一個 word 有 3個數字與它們相關，對應3個 dimensions，word 的第一個 dimension里面的數子對應該 word 在所有 tiltes 里面出現的次數，所以它不如第二個和第三個 dimension ?有用；類似地，每個 title 有3個數字與之相關，對應3個 dimensions，同樣地，第一個 dimension 里面的數字對應該 title 包含的 words 的數目，即該 title 的長度，它也被丟棄

book	0.15	-0.27	0.04
dads	0.24	0.38	-0.09
dummies	0.13	-0.17	0.07
estate	0.18	0.19	0.45
guide	0.22	0.09	-0.46
investing	0.74	-0.21	0.21
market	0.18	-0.30	-0.28
real	0.18	0.19	0.45
rich	0.36	0.59	-0.34
stock	0.25	-0.42	-0.28
value	0.12	-0.14	0.23

3.91	0	0
0	2.61	0
0	0	2.00

T1	T2	T3	T4	T5	T6	T7	T8	T9
0.35	0.22	0.34	0.26	0.22	0.49	0.28	0.29	0.44
-0.32	-0.15	-0.46	-0.24	-0.14	0.55	0.07	-0.31	0.44
-0.41	0.14	-0.16	0.25	0.22	-0.51	0.55	0.00	0.34

Part 4 -- 使用 color 進行 clustering

將數字轉換為 colors，藍色代表負數，紅色代表正數，白色代表接近0的數字：

We can use these colors to cluster the titles. We ignore the first dimension for clustering because all titles are red. In the second dimension, we have the following result.

Dim2	Titles
red	6-7, 9
blue	1-5, 8

Using the third dimension, we can split each of these groups again the same way. For example, looking at the third dimension, title 6 is blue, but title 7 and title 9 are still red. Doing this for both groups, we end up with these 4 groups.

Dim2	Dim3	Titles
red	red	7, 9
red	blue	6
blue	red	2, 4-5, 8
blue	blue	1, 3

It’s interesting to compare this table with what we get when we graph the results in the next section.

Part 5 - Clustering by Value

Leaving out the first dimension, as we discussed, let's graph the second and third dimensions using a XY graph. We'll put the second dimension on the X axis and the third dimension on the Y axis and graph each word and title. It's interesting to compare the XY graph with the table we just created that clusters the documents.

In the graph below, words are represented by red squares and titles are represented by blue circles. For example the word "book" has dimension values (0.15, -0.27, 0.04). We ignore the first dimension value 0.15 and graph "book" to position (x = -0.27, y = 0.04) as can be seen in the graph. Titles are similarly graphed.

One advantage of this technique is that both words and titles are placed on the same graph. Not only can we identify clusters of titles, but we can also label the clusters by looking at what words are also in the cluster. For example, the lower left cluster has titles 1 and 3 which are both about stock market investing. The words "stock" and "market" are conveniently located in the cluster, making it easy to see what the cluster is about. Another example is the middle cluster which has titles 2, 4, 5, and, to a somewhat lesser extent, title 8. Titles 2, 4, and 5 are close to the words "value" and "investing" which summarizes those titles quite well.

LSI 的優缺點以及應用

優點：

1. documents 和 words 都被映射到同一個 concept space，在這個 space 里面，我們可以進行 cluster documents，cluster words，并且更重要的是，我們可以給定 words，搜索 documents，反之亦然

2. 得到的 concept space 和原來的 matrix 比起來，包含少得多的 dimensions，這些 dimensions 包含最重要信息，最少的 noiese，所以這個 concept space 可以用來使用運行其它算法，例如測試不同的 clustering 算法

3. LSI 是一個 global algorithm，它基于所有的 words 和 documents 尋找 trends 和 pattern，所以它可能找到其它 local algorithms 不能找到的信息，它還可以結合 local algorithms 使用，例如 nearest neighbours，從而變得更加有用

缺點：

1. LSI 假設數據符合 Gaussian distribution 和 Frobenius norm，這并不適合所有的情況，例如，documents 中的 words 服從 Poisson distribution，而非 Gaussian distribution

2. LSI 假設一個 word 只有一個 concept，所以不能處理一詞多義的情況

3. LSI 依賴于 SVD，需要大量的計算，所以當有新的 document 時，難以更新

盡管有這么多缺點，LSI 仍被大量使用，例如尋找和組織搜索結果，文檔聚類，垃圾過濾，語音識別，專利查找，自動文章評價等