单词嵌入_单词嵌入与单词袋:推荐系统的奇怪案例
單詞嵌入
詞嵌入始終是最佳選擇嗎? (Are word embeddings always the best choice?)
If you can challenge a well-accepted view in data science with data, that’s pretty cool, right? After all, “in data we trust”, or so we profess! Word embeddings have caused a revolution in the world of natural language processing, as a result of which we are much closer to understanding the meaning and context of text and transcribed speech today. It is a world apart from the good old bag-of-words (BoW) models, which rely on frequencies of words under the unrealistic assumption that each word occurs independently of all others. The results have been nothing short of spectacular with word embeddings, which create a vector for every word. One of the oft used success stories of word embeddings involves subtracting the man vector from the king vector and adding the woman vector, which returns the queen vector:
如果您可以用數據挑戰數據科學界公認的觀點,那很酷,對嗎? 畢竟,“我們相信數據”,或者說我們自稱! 詞嵌入已在自然語言處理領域引起了一場革命,其結果是,我們現在更加接近于理解文本和轉錄語音的含義和上下文。 這是一個與舊的單詞袋(BoW)模型不同的世界,該模型在不切實際的假設(每個單詞獨立于所有其他單詞)的情況下依賴單詞的頻率。 單詞嵌入的結果是驚人的,為每個單詞創建一個向量。 單詞嵌入的常用成功案例之一涉及從國王向量中減去男人向量,并添加女人向量,這將返回女王向量:
Very smart indeed! However, I raise the question whether word embeddings should always be preferred to bag-of-words. In building a review-based recommender system, it dawned on me that while word embeddings are incredible, they may not be the most suitable technique for my purpose. As crazy as it may sound, I got better results with the BoW approach. In this article, I show that the uber-smart feature of word embeddings in being able to understand related words actually turns out to be a shortcoming in making better product recommendations.
確實很聰明! 但是,我提出一個問題,即單詞嵌入是否應始終比單詞袋更受青睞。 在建立基于審閱的推薦系統時,我突然意識到,盡管詞嵌入令人難以置信,但它們可能不是我所希望的最合適的技術。 盡管聽起來很瘋狂,但BoW方法使我獲得了更好的結果。 在本文中,我證明了單詞嵌入的超級智能功能,因為它能夠理解相關的單詞,實際上在提出更好的產品推薦方面是一個缺點。
單詞嵌入 (Word embeddings in a jiffy)
Simply stated, word embeddings consider each word in its context; for example, in the word2vec approach, a popular technique developed by Tomas Mikolov and colleagues at Google, for each word, we generate a vector of words with a large number of dimensions. Using neural networks, the vectors are created by predicting for each word what its neighboring words may be. Multiple Python libraries like spaCy and gensim have built-in word vectors; so, while word embeddings have been criticized in the past on grounds of complexity, we don’t have to write the code from scratch. Unless you want to dig into the math of one-hot-encoding, neural nets and complex stuff, using word vectors today is as simple as using BoW. After all, you don’t need to know the theory of internal combustion engines to drive a car!
簡而言之,詞嵌入考慮了每個詞的上下文; 例如,在word2vec方法中,這是由Tomas Mikolov和Google的同事開發的一種流行技術,對于每個單詞,我們都會生成具有大量維的單詞向量。 使用神經網絡,通過為每個單詞預測其相鄰單詞可能是什么來創建向量。 諸如spaCy和gensim之類的多個Python庫具有內置的字向量。 因此,盡管過去以復雜性為由批評詞嵌入,但我們不必從頭開始編寫代碼。 除非您想深入研究一鍵編碼,神經網絡和復雜事物的數學原理,否則今天使用單詞向量就像使用BoW一樣簡單。 畢竟,您無需了解駕駛汽車的內燃機原理!
退一步說的話(BoW) (Throwback to the bag-of-words (BoW))
In the BoW models, similarity between two documents using either cosine or Jaccard similarity literally checks which or how many words are exactly the same across two documents. To appreciate how much smarter the word embeddings approach is, let me use an example shared by user srce code on stackoverflow.com. Consider two sentences: (i) “How can I help end violence in the world?” (ii) “What should we do to bring global peace?” Let us see how word vectors versus BoW based cosine similarity treat these two sentences.
在BoW模型中,使用余弦或Jaccard相似度的兩個文檔之間的相似度從字面上檢查兩個文檔中哪個或多少個單詞完全相同。 為了了解單詞嵌入方法有多精明 ,讓我使用stackoverflow.com上用戶srce代碼共享的示例。 考慮兩個句子:(i)“我如何幫助結束世界上的暴力行為?” (ii)“我們應該怎么做才能實現全球和平?” 讓我們看看單詞向量與基于BoW的余弦相似度如何處理這兩個句子。
#code to calculate spaCy and BoW similarity scores from two textsimport spacy
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd#spaCy uses word vectors for medium (md) and large (lg)
nlp = spacy.load('en_core_web_md')
text1 = 'How can I help end violence in the world?'
text2 = 'What should we do to bring global peace?'#Calculates spaCy similarity between texts 1 and 2
doc1 = nlp(text1)
doc2 = nlp(text2)
print("spaCy similarity:", doc1.similarity(doc2))#Calculate cosine similarity using Bag-of-Words
documents =[text1, text2]
count_vectorizer = CountVectorizer(stop_words='english')
sparse_matrix = count_vectorizer.fit_transform(documents)
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix, columns=count_vectorizer.get_feature_names(), index=['x', 'y'])
print("Cosine similarity:", cosine_similarity(df, df)[0,1])
With its word vectors, spaCy assigns a similarity of 91% to these two sentences, while the BoW cosine similarity score, not surprisingly, turns out to be 0 (since the two sentences do not share a single word). In most applications, the word vector approach will probably win hands down.
通過其單詞向量,spaCy為這兩個句子分配了91%的相似度,而BoW余弦相似度分數毫不奇怪地變為0(因為這兩個句子不共享一個單詞)。 在大多數應用中,詞向量方法可能會贏得人們的青睞。
But not all may be well in paradise, and “Houston, we (may) have a problem.” Let us consider a pair of hi-fi speakers or headphones, and focus on two short reviews: (i) “These speakers have an excellent soundstage.” (ii) “These speakers boast outstanding imaging.” The spaCy model gives a similarity score of 86% between these two reviews, while BoW cosine similarity returns a value of 29%. The spaCy similarity score is very high because people often mention the features soundstage and imaging in close proximity of each other; however, these features mean very different things. Obviously, the ability to distinguish between soundstage and imaging in user or expert reviews will be critical in a building a recommender system for high-end speakers or headphones. So what is considered a strength of word embeddings may turn out to be a shortcoming in this context. Now onto the recommender system that I built for running shoes, and how and why I got better results with BoW.
但是,并非所有人都在天堂里生活得很好,“休斯頓,我們(可能)有問題。” 讓我們考慮一對高保真揚聲器或耳機,重點關注兩個簡短的評論:(i)“這些揚聲器具有出色的音場。” (ii)“這些揚聲器擁有出色的影像效果。” spaCy模型在這兩個評論之間給出的相似度得分為86%,而BoW余弦相似度返回的值為29%。 空間相似度得分非常高,因為人們經常提到聲場和影像之間非常接近的特征。 但是,這些功能意味著截然不同的事情。 顯然,在構建用于高端揚聲器或耳機的推薦系統時,區分用戶或專家評論中的聲場和影像的能力至關重要。 因此,在這種情況下,單詞嵌入的強度可能會成為缺點。 現在介紹我為跑鞋構建的推薦系統,以及如何以及為什么在BoW上獲得更好的結果。
Soundstage vs Imaging explained聲場與影像的解釋使用單詞嵌入與BoW的推薦系統 (A recommender system using word embeddings vs BoW)
My system takes features of a pair of running shoes (e.g., cushion, comfort, durability, support, etc.) as inputs from a user, and matches those preferred features to products using lots of customer reviews. I naturally started with spaCy for obtaining similarity scores between the features a shopper wants in running shoes and the product reviews I scraped from the Amazon website. After all, with spaCy, I could avoid having to change different parts of speech of a feature word into its noun form (e.g., replace comfortably or comfortable by comfort) or replace synonyms of a feature word (e.g., replace indestructible with durability). SpaCy would be smart enough to figure these things out on its own! With a given set of features, I calculate the similarity score for each review, and then the average similarity for every product. The three products with highest average similarity scores are shown below with three preferred features: cushion, comfort and durability (while I also use feature-level sentiment analysis to provide recommendations, it is not relevant in this discussion):
我的系統將一雙跑鞋的功能(例如, 緩沖墊 , 舒適性 , 耐用性,支撐性等)作為用戶的輸入,并通過大量的客戶評論將這些首選功能與產品進行匹配。 我自然是從spaCy開始的,目的是獲得購物者想要的跑鞋功能與我從亞馬遜網站上刮取的產品評論之間的相似度分數。 畢竟,借助spaCy,我可以避免將特征詞的不同詞性更改為名詞形式(例如,舒適地替換為舒適 ),或替換特征詞的同義詞(例如,替換為不可破壞的耐久性 )。 SpaCy足夠聰明,可以自行解決這些問題! 使用給定的功能集,我計算每個評論的相似性得分,然后計算每個產品的平均相似性。 下面顯示了具有最高平均相似性評分的三種產品,它們具有三個優選特征: 緩沖 , 舒適性和耐用性 (盡管我也使用特征級別的情感分析來提供建議,但在本討論中不相關):
Table 1: Three products with top similarity scores using spaCy word vectors表1:使用spaCy詞向量的三個相似度最高的產品For the top three matching scores in Table 1, you will note that some of the product features were mentioned only in a small percentage of reviews (columns 2, 3, and 4). That’s not good, for if only 2–6% of reviews of a product mention a feature that a shopper considers important, I wouldn’t feel comfortable recommending that product to him/her. Curiously enough, when I used the plain vanilla BoW cosine similarity, I did not face this problem (Table 2).
對于表1中的前三項匹配得分,您會注意到,只有少數評論(第2、3和4列)提到了某些產品功能。 這不是很好,因為如果只有2–6%的產品評論提到購物者認為重要的功能,那么我就不愿意向他/她推薦該產品。 奇怪的是,當我使用普通的BoW余弦相似度時,我沒有遇到這個問題(表2)。
Table 2: Three products with top similarity scores using BoW cosine similarity表2:使用BoW余弦相似度獲得最高相似度評分的三種產品Because it looks for an exact match of words, BoW does not consider product feature words like cushion and comfort as similar or related, and produces a significantly lower similarity score if one of these features is missing in a review. But given that the word vectors approach considers cushion and comfort to be related, it barely increases the similarity score for reviews that mention both features over those that discuss only one. To check out my hunch further, I made up two reviews, one mentioning cushion and durability, and the other cushion, comfort and durability (Table 3).
由于BoW會尋找單詞的完全匹配項,因此BoW不會將諸如坐墊和舒適性之類的產品功能詞視為相似或相關,并且如果評論中缺少這些功能之一,則相似度得分會大大降低。 但是考慮到詞向量方法認為緩沖性和舒適性是相關的,因此與僅討論兩個特征的評論相比,對于提及兩個特征的評論,其勉強提高了相似度。 為了進一步了解我的直覺,我進行了兩項評論,其中一項提到緩沖和耐用性 ,另一項提到緩沖 , 舒適性和耐用性 (表3)。
Table 3: SpaCy and BoW similarity scores with three preferred product features表3:SpaCy和BoW相似性得分以及三種首選產品功能Indeed, I get a much bigger increase of 45% in the similarity score with the BoW approach when the feature comfort is also mentioned, while the corresponding increase with spaCy word vectors is a not-so-impressive 5.3%.
確實,當還提到特征舒適度時,與BoW方法相比,我在相似性分數上獲得了更大的提高,達到了45%,而spaCy單詞向量的相應提高卻不那么令人印象深刻,為5.3%。
Would the problem I encountered with word embeddings go away if I created my own word vectors just from shoe reviews? Not really, for just as one may write regarding a sofa set: “The firm cushion is comfortable for my back,” a cross country runner, for whom cushion and comfort are two different features of shoes, may write: “These shoes have excellent cushion and comfort.” When word vectors are created with our own data, cushion and comfort will still come across being close to each other, and we will continue to face the same challenge.
如果僅根據鞋評創建自己的單詞向量,我遇到的單詞嵌入問題就會消失嗎? 并非如此,就像一個關于沙發套件的人可能寫的那樣:“堅固的坐墊對我的后背很舒適,”越野跑者說到:“ 坐墊和舒適性是鞋子的兩個不同特征”:緩沖和舒適。” 當使用我們自己的數據創建單詞向量時,彼此之間仍然會感到 緩沖和舒適 ,而我們將繼續面臨同樣的挑戰。
Cushion vs Comfort explained墊vs舒適解釋最后的想法 (Final Thoughts)
Word embeddings have done wonders, bringing much needed semantics and context to words, which were just treated as frequency counts without any sequence or meaning in the BoW models. However, the example of recommender systems I provided suggests that we should not automatically assume that the newer methods will perform better in every application. Especially in settings where two or more words, which have distinct meanings, are likely to be mentioned close to each other, word embeddings may fail to adequately distinguish between them. In applications where such distinctions matter, the results are likely to be unsatisfactory, as I found out the hard way. The current state-of-the-art in word embeddings appears to be ill-equipped to handle the situation I described above.
詞嵌入已經創造了奇跡,為詞帶來了很多需要的語義和上下文,這些詞在BoW模型中被當作頻率計數而沒有任何序列或含義。 但是,我提供的推薦系統示例表明,我們不應該自動假定較新的方法在每個應用程序中都會表現更好。 尤其是在兩個或兩個以上具有不同含義的單詞可能彼此靠近的情況下,單詞嵌入可能無法充分區分它們。 如我所知,在這種區分很重要的應用程序中,結果很可能無法令人滿意。 當前在詞嵌入方面的最新技術似乎不足以應對我上面描述的情況。
翻譯自: https://medium.com/swlh/word-embeddings-versus-bag-of-words-the-curious-case-of-recommender-systems-6ac1604d4424
單詞嵌入
總結
以上是生活随笔為你收集整理的单词嵌入_单词嵌入与单词袋:推荐系统的奇怪案例的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 机器学习与分布式机器学习_机器学习的歧义
- 下一篇: 火傀儡怎么做(火到底是什么)