基于堆栈的缓冲区溢出_基于堆栈溢出问题构建搜索引擎
基于堆棧的緩沖區溢出
Let’s take a quick overview on Stack Overflow, before we dive deep into the project itself. Stack Overflow is one of the largest QA platform for computer programmers. People posts questions-queries associated with wide range of topics (mostly related to computer programming) and fellow users try to resolve queries in the most helpful manner.
在深入研究項目本身之前,讓我們快速了解一下Stack Overflow 。 堆棧溢出是計算機程序員最大的質量檢查平臺之一。 人們發布與廣泛主題(主要與計算機編程相關)相關的問題查詢,而其他用戶則嘗試以最有用的方式解決問題。
指標:逐步方法 (INDEX : Step by step approach)
Section 1 : Brief overview1. Business problem : Need of search engine.2. 2.1. Dataset2.2. The process flow
2.3. High level Overview3. Exploratory data analysis and Data pre-processing
-----------------------------------------------------
SECTION 2 : The attack plan
4. Modelling : The tag predictor
4.1. A TAG Predictor model
4.2. TRAIN_TEST_SPLIT
4.3 Time based splitting Modelling
4.4. GRU based Encoder-decoder seq2seq model
4.5. Model embedding
4.6. Word2Vec embedding
4.7. Multi-label target problem5. LDA (Latent Dirichlet allocation) : Topic Modelling6. Okapi BM25 Score : simplest searching technique7. Sentence embedding : BERT8. Sentence embedding : Universal sentence encoder
-----------------------------------------------------SECTION 3 : Productionizing the solution
9. Entire pipeline deployment on a remote server
9.1. A Cloud platform
9.2. Web App using Flask-----------------------------------------------------
SECTION 4 : Results and conclusion
10. Results and conclusion
10.1 Final Results : BERT
10.2. Final Results : USE
10.3. Final Inferences
1. Business problem : Need of search engine.
1.業務問題:需要搜索引擎。
‘StackOverflow’ is sitting on the huge amount of information contributed by the public for the public. Nonetheless, large amounts of data also makes it very difficult to retrieve the exact solution one is looking for. Now, It becomes the primary duty of ‘StackOverflow’ (or any such facility provider) to serve the exact solution to which users are querying in the search bar on their website. Otherwise, the worst case scenario could be : even if ‘StackOverflow’ has the exact solution users are looking for, but ‘StackOverflow’ will not be able to serve the solution, just because of the weak search-engine mechanism OR No-searching mechanism. In simple words, out of terribly huge amounts of data, Search engine helps in finding exactly what users are looking for. One can understand the need of a ‘searching mechanism’ by imagining if Google was not there! or no product search bar on Amazon.
“ StackOverflow”坐擁公眾為公眾提供的大量信息。 盡管如此,大量的數據也使檢索一個正在尋找的確切解決方案變得非常困難。 現在,“ StackOverflow”(或任何此類設施提供商)的主要職責是提供用戶正在其網站搜索欄中查詢的確切解決方案。 否則,最壞的情況可能是:即使'StackOverflow'具有用戶正在尋找的確切解決方案,但是'StackOverflow'也將無法為該解決方案提供服務,這僅僅是因為搜索引擎機制或無搜索機制較弱。 簡而言之,在非常龐大的數據量中,搜索引擎可幫助您準確找到用戶要查找的內容。 可以想象一下Google是否不存在,從而了解“搜索機制”的必要性! 或亞馬遜上沒有產品搜索欄。
Actual query search results on StackOverflow.comStackOverflow.com上的實際查詢搜索結果One line problem statement : “To build a ranking mechanism on the basis of ‘StackOverflow’ Questions-Answers data, which results in a set of related Questions for provided search-query”.
一個簡單的問題陳述 :“基于'StackOverflow'Questions-Answers數據建立排名機制,從而為提供的搜索查詢生成一組相關的Questions”。
2. Dataset :
2.數據集:
‘StackOverflow’ has a giant amount of Questions asked by various specialized or in general communities, also answers and comments associated with each Question. Usually, Answers and the Comments are contributed on the platform by experts, or community enthusiasts. Basically It’s public Q-A forum. ‘StackOverflow’ has made sample of it’s large ‘Question-Answer’ data publically available for research and development in the data-community, which is available at : https://archive.org/details/stackexchange
“ StackOverflow”具有大量由各個專業或普通社區提出的問題,以及與每個問題相關的答案和評論。 通常,答案和評論是由專家或社區愛好者在平臺上提供的。 基本上是公開的質量檢查論壇。 “ StackOverflow”已將其龐大的“ Question-Answer”數據樣本公開提供給數據社區進行研究和開發,可在以下網址獲取: https : //archive.org/details/stackexchange
The data from below StackExchange properties are used :
使用以下StackExchange屬性中的數據:
1. cs.stackexchange.com.7z2. datascience.stackexchange.com.7z
3. ai.stackexchange.com.7z
4. stats.stackexchange.com.7zpandas dataframe熊貓數據框
The process flow :
處理流程:
The flow of solution is mainly broken into two parts :
解決方案的流程主要分為兩個部分:
High level Overview :
高層概述:
Acquisition of appropriate data from Archive.org
從Archive.org獲取適當的數據
- Exploratory data analysis and Data pre-processing. 探索性數據分析和數據預處理。
- Training a tag-predictor model |gathering chunk of questions associated with predicted ‘tag’. 訓練標簽預測器模型,收集與預測的“標簽”相關的問題。
- LDA topic modelling | gathering chunk of questions associated with predicted ‘topic’. LDA主題建模| 收集與預測的“主題”相關的問題。
- BM25 results | gathering chunk of questions associated with BM25 results. BM25結果| 收集與BM25結果相關的大部分問題。
- Combining all three chunks of questions. 結合所有三個問題。
- Sentence embedding : BERT/Universal sentence encoder (Vectorization) 句子嵌入:BERT /通用句子編碼器(矢量化)
- Ranking the results based on the metric : ‘cosine_distance’ 根據指標對結果進行排名:“ cosine_distance”
3. Exploratory data analysis and Data pre-processing :
3.探索性數據分析和數據預處理:
Lets take a look at all available features and checking for the missing values.
讓我們看一下所有可用功能并檢查缺失值。
# Merging all dataFrames into one dataframe df = pd.concat([df_ai, df_cs, df_ds, df_stats]) df.info()Our dataset has three prominent features on which we are building the searching mechanism: Question-Title, Question-Body, Question-Tags. Each question can have one to five tags along with the question title and its descriptive text. For the purpose of case study I chose to go with 216k unique questions only.
我們的數據集具有構建搜索機制的三個突出特征:問題標題,問題正文,問題標簽。 每個問題可以帶有一到五個標簽,以及問題標題及其描述性文本。 出于案例研究的目的,我選擇僅回答216k個獨特的問題。
Now, we are ready to move onto the data cleaning and further data processing steps…
現在,我們準備著手進行數據清理和進一步的數據處理步驟…
All stack exchange properties are nothing but web pages, web based data usually has html tags, html comparisons. These tags and special symbols does not contribute any information to the end task, hence I decided to remove them. If you observe closely, some data points are closely related to mathematics domain, For this experiment I chose to remove latex, formulae too. Its proven empirical result in the previous state of the art NLP experiments that applying ‘decontractions’ tend to to give better results.
所有堆棧交換屬性僅是網頁,基于Web的數據通常具有html標簽,html比較。 這些標簽和特殊符號對最終任務沒有任何幫助,因此我決定將其刪除。 如果仔細觀察,某些數據點與數學領域密切相關。對于本實驗,我選擇了刪除乳膠,也刪除了公式。 在以前的最新NLP實驗中,其經過驗證的經驗結果表明,應用“解壓縮”往往會產生更好的結果。
wordcloud : tagswordcloud:標簽 wordcloud : titlewordcloud:標題 wordcloud : bodywordcloud:身體Since StackExchange has questions-answer in the range of technical subjects, Also community uses its own set of words i.e., vocabulary. To avoid losing that subjects specific crucial information avoid using Lemmatization/Stemming.
由于StackExchange在技術主題范圍內具有問答功能,因此社區也使用自己的一組單詞,即詞匯表。 為了避免失去主題,特定的關鍵信息應避免使用Lemmatization / Stemming 。
Full EDA with inferences can be found out here on my github profile.
帶有推論的完整EDA可以在我的github個人資料中找到 。
4. A TAG Predictor model :
4. TAG預測器模型:
Tag can be considered as a kind of topic/subject for any posted question is given by the (publisher) user himself. Sometimes tags are system generated too. It is important to note that our problem is not only a multi-class problem, it’s a multi-label problem i.e., each datapoint can have one or more than tags.
對于(發布者)用戶本人給出的任何發布的問題,可以將標簽視為一種主題/主題。 有時標簽也是系統生成的。 重要的是要注意,我們的問題不僅是多類問題,還是多標簽問題,即每個數據點可以有一個或多個標簽。
There were some tags in the dataset are too frequent than other tags, since highly frequent target tags will dominate over the model hence I chose to go with a threshold value of 7000 tags. I removed tags which are occurring more than 7000 times in the dataset. Also some tags are very less frequent, so model couldn’t have learnt anything from such target labels. Hence, remove the tags which are occurring less than 500 times in the dataset.
數據集中的某些標簽比其他標簽過于頻繁,因為高頻率的目標標簽將在模型中占主導地位,因此我選擇閾值為7000個標簽。 我刪除了在數據集中出現超過7000次的標簽。 另外,某些標簽的使用頻率不太高,因此模型無法從此類目標標簽中學到任何東西。 因此,刪除在數據集中出現少于500次的標簽。
TRAIN_TEST_SPLIT : Time based splitting
TRAIN_TEST_SPLIT:基于時間的拆分
The questions and answers on the public QA platforms like Stack Overflow usually evolve over the period of the time. If we would not have considered the dependency of the data over the ‘time function’, generalization accuracy (accuracy on unseen data) will no longer be able to stay close to cross validation accuracy in the production environment. Whenever timestamp is available and data is time dependent — often it’s best choice to split dataset ‘based on time’ than ‘random_split’.
諸如Stack Overflow之類的公共質量檢查平臺上的問題和答案通常會隨著時間的推移而變化。 如果我們不考慮數據對“時間函數”的依賴性,那么泛化精度(對看不見的數據的準確性)將不再能夠接近生產環境中的交叉驗證精度。 只要有時間戳并且數據取決于時間,通常最好是“基于時間”拆分數據集而不是“ random_split”。
Modelling : GRU based Encoder-decoder seq2seq model
建模:基于GRU的編解碼器seq2seq模型
# Constructing a seq2seq model tf.keras.backend.clear_session()enc_inputs = Input(name = 'text_seq', shape = (250,)) enc_embed = text_embedding_layer(enc_inputs) encoder = Bidirectional(GRU(name = 'ENCODER', units = 128, dropout = 0.2))enc_out = encoder(enc_embed)dec_lstm = GRU(name = 'DECODER', units = 256, dropout = 0.2, return_sequences= True, return_state= True) repeat = RepeatVector(5)(enc_out)dec_out, dec_hidden = dec_lstm(repeat) dec_dense = Dense(units = len(tar_vocab)+1, activation = 'softmax')out = dec_dense(dec_out) model = Model(inputs = enc_inputs, outputs = out)model.summary()GRU based Enoder-decoder model基于GRU的Enoder解碼器模型Model embedding : Word2Vec embedding
模型嵌入:Word2Vec嵌入
For a encoder decoder model I trained custom W2V embedding on StackExchange data. Word2Vec embedding technique needs huge amount of textual data to learn good quality of word vectors. To gather huge amount of data, I decided to merge ‘clean_title + clean_body + clean_comments’ together, to train w2v embeddings upon. Also, used skip-gram W2V instead of CBOW, which has ability to provide better results over smaller amount of data too. Manually trained w2v embedding has given good results :
對于編碼器解碼器模型,我訓練了在StackExchange數據上嵌入的自定義W2V。 Word2Vec嵌入技術需要大量的文本數據才能學習高質量的詞向量。 為了收集大量數據,我決定將“ clean_title + clean_body + clean_comments”合并在一起,以訓練w2v嵌入。 同樣,使用跳過語法W2V代替CBOW,它也能夠在更少量的數據上提供更好的結果。 手動訓練的w2v嵌入效果良好:
w2v_model_sg.most_similar('sigmoid')Manual W2V results.手動W2V結果。Multi-label target problem :
多標簽目標問題:
In deep learning modelling, There are many ways to handle multi label target features few of them are listed below:
在深度學習建模中,有多種處理多標簽目標特征的方法,以下僅列出其中幾種:
For the current project work, I tried both methods. In the end 2nd method has been employed as it gave better results.
對于當前的項目工作,我嘗試了兩種方法。 最后,由于使用了第二種方法,因此效果更好。
Note : For the complete data handling and tweaking with code — visit my github_profile.
注意:有關代碼的完整數據處理和調整,請訪問我的github_profile 。
5. LDA (Latent Dirichlet allocation) : Topic Modelling
5. LDA(潛在狄利克雷分配):主題建模
There are many approaches to topic modeling, Latent Dirichlet Allocation is the most popular topic modeling technique amongst all. Basically LDA is an application of mathematical matrix factorization technique in order to generate labels or topics for each document in the corpus. LDA takes a word frequency matrix as below, and tries to factorize a given matrix into two more matrices.
主題建模的方法很多,其中潛在的Dirichlet分配是最流行的主題建模技術。 基本上,LDA是數學矩陣分解技術的一種應用,目的是為語料庫中的每個文檔生成標簽或主題。 LDA采取如下所示的字頻矩陣,并嘗試將給定的矩陣分解為另外兩個矩陣。
6. Okapi BM25 Score :
6.霍加api BM25得分:
Okapi BM25 is simplest search engine introduced in 1990’s for information retrieval. The formulation of BM25 is very similar to TF-IDF. BM25 technique is such a effective searching algorithm used in very popular Elasticsearch software too.
Okapi BM25是1990年代推出的用于信息檢索的最簡單的搜索引擎。 BM25的配方與TF-IDF非常相似。 BM25技術也是一種非常流行的Elasticsearch軟件中使用的有效搜索算法。
wikipedia維基百科7. Sentence embedding : BERT
7.句子嵌入:BERT
BERT (Bidirectional Encoder Representations from Transformers) is language representation deep neural networks based technique. The biggest advantage of BERT over most of the previously found embedding techniques (except: w2v, but its word embedding) is ‘Transfer learning’ and ‘sentence-embeddings’. In 2014 VGG16 had made its own place in the space of computer vision community because of its ‘Transfer learning’ advantage, which was not possible in Natural language processing space at early stages. There are two existing strategies for applying pre-trained language representations: ‘feature-based’ and ‘fine-tuning’. Since the dataset has mathematics and computer science specific documents most likely ‘fine-tuning’ will help for semantic searching, But because of computational expenses I am limiting ourselves to ‘feature-based’ embedding only. This model has been pre-trained for English on the Wikipedia and BooksCorpus databases. BERT pretrained model architecture : here we get 2 options
BERT(來自變壓器的雙向編碼器表示)是基于語言表示深度神經網絡的技術。 BERT與以前發現的大多數嵌入技術(w2v除外,但其詞嵌入)相比,最大的優勢是“轉移學習”和“句子嵌入”。 2014年,VGG16因其“轉移學習”優勢而在計算機視覺社區中占據了自己的一席之地,這在早期自然語言處理領域是不可能的。 現有兩種應用預訓練語言表示的策略:“基于功能”和“微調”。 由于數據集包含數學和計算機科學方面的特定文檔,因此很可能“微調”將有助于語義搜索。但是由于計算量大,我將自己僅限于“基于功能”的嵌入。 該模型已在Wikipedia和BooksCorpus數據庫上進行了英語預訓練。 BERT預訓練模型架構:在這里我們有2個選擇
I. BASE : (L=12, H=768, A=12, Total Parameters=110M)
I.BASE:(L = 12,H = 768,A = 12,總參數= 110M)
II. LARGE : (L=24, H=1024, A=16, Total Parameters=340M)
二。 大:(L = 24,H = 1024,A = 16,總參數= 340M)
where; L = no.of transformer blocks,
哪里; L =變壓器塊數
H = Hidden layer size | A = number of self-attention heads
H =隱藏層大小| A =自我注意的負責人數量
8. Sentence embedding : Universal sentence encoder
8.句子嵌入:通用句子編碼器
As the name (UNIVERSAL SENTENCE ENCODER) itself is self explanatory, ‘USE’ takes sentences as input and gives high dimensional vector representations of input text sentences. The input can be a variable length English text and the output will be a 512 dimensional vector. The universal-sentence-encoder model has two variants in it. one is trained with a deep averaging network (DAN) encoder, and another is trained with a Transformer. In this paper authors have mentioned that Transformer based USE tends to give better results than DAN. But it comes with price in terms of computational resources and run time complexity. The USE model is trained on the sources which are Wikipedia, web news, e Stanford Natural Language Inference (SNLI) corpus, web question-answer pages and discussion forums etc. The Author explains the USE model was created by keeping in mind unsupervised NLP tasks like Transfer learning using sentence embeddings, Hence it makes complete sense to use USE in our semantic search task.
由于名稱(UNIVERSAL SENTENCE ENCODER)本身是不言自明的,因此“ USE”將句子作為輸入,并提供輸入文本句子的高維向量表示。 輸入可以是可變長度的英文文本,輸出是512維向量。 通用語句編碼器模型具有兩個變體。 一個受過深度平均網絡(DAN)編碼器的培訓,另一受過變壓器的培訓。 在本文中,作者提到基于Transformer的USE往往比DAN提供更好的結果。 但是,它在計算資源和運行時間復雜性方面都付出了代價。 USE模型在以下方面進行了培訓:維基百科,網絡新聞,斯坦福自然語言推理(SNLI)語料庫,網絡問答頁面和討論論壇等。例如使用句子嵌入進行轉移學習,因此在我們的語義搜索任務中使用USE完全有意義。
Sentence similarity scores using embeddings from the universal sentence encoder.使用來自通用句子編碼器的嵌入來進行句子相似度評分。9. Entire pipeline deployment on a remote server :
9.整個管道在遠程服務器上的部署:
Running model after deployment on a remote server在遠程服務器上部署后運行模型A Cloud platform :
一個云平臺:
To serve the purpose for the current project work, I deployed the entire pipeline on Google cloud platform (gcp). Launching a virtual machine with google cloud platform or Amazon web services (AWS) is way easy. I used very basic virtual machine instance with following specs :
為了達到當前項目工作的目的,我將整個管道部署在G oogle云平臺 (gcp)上 。 使用Google Cloud Platform啟動虛擬機 或Amazon Web Services (AWS)很簡單。 我使用了具有以下規格的非常基本的虛擬機實例:
Web App using Flask :
使用Flask的Web應用程序:
# Flask app import flask app = Flask(__name__)@app.route('/index') def index():return flask.render_template('index.html')@app.route('/predict', methods=['POST']) def predict():lst = request.form.to_dict()print(lst)results = USE_results(query = lst['query'], n = 10).tolist()return jsonify({'Search_results' : results})To develop a web app with python and HTML, Flask is the easy and light weight open source framework. I’ve pasted required web app deployment code for any virtual machine my github, feel free to visit my profile.
要開發使用python和HTML的Web應用程序,Flask是簡單輕便的開源框架。 我已經為github上的任何虛擬機粘貼了必需的Web應用程序部署代碼,請隨時訪問我的個人資料。
10.1 Final Results : BERT
10.1最終結果:BERT
BERT Results : 1BERT結果:1 BERT Results : 2BERT結果:2 BERT Results : 3BERT結果:310.2. Final Results : USE
10.2。 最終結果:USE
USE Results : 1使用結果:1 USE Results : 2使用結果:2 USE Results : 3使用結果:310.3. Final Inferences :
10.3。 最終推斷:
Please find complete code with documentation : https://github.com/vispute/StackOverflow_semantic_search_engine
請找到包含文檔的完整代碼: https : //github.com/vispute/StackOverflow_semantic_search_engine
Feel free to connect with me on LinkedIn : https://www.linkedin.com/in/akshay-vispute-a34bb5136/
隨時在LinkedIn上與我聯系 : https : //www.linkedin.com/in/akshay-vispute-a34bb5136/
Future work :
未來的工作 :
1. Take up to some million no.of datapoints into use.
1.最多使用數百萬個數據點。
2. Pretrain BERT, USE sentence embedding on StackExchange data or downstream task.
2.預訓練BERT,將USE語句嵌入到StackExchange數據或下游任務上。
3. Tutorial for ‘entire pipeline deployment on a remove web server like gcp or AWS or Azure’.
3.“在刪除的Web服務器(如gcp或AWS或Azure)上進行整個管道部署”的教程。
References :
參考文獻:
1. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding — https://arxiv.org/pdf/1810.04805.pdf
1. BERT:用于語言理解的深度雙向變壓器的預訓練— https://arxiv.org/pdf/1810.04805.pdf
2. USE : UNIVERSAL SENTENCE ENCODER https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/4 6808.pdf
2.用途:通用句子編碼器https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/4 6808.pdf
3. Topic Modelling — LDA (Latent Dirichlet Allocation) — LDA paper : http://link.springer.com/chapter/10.1007%2F978-3-642-13657-3_43
3.主題建模-LDA(潛在狄利克雷分配)-LDA論文: http : //link.springer.com/chapter/10.1007%2F978-3-642-13657-3_43
4. Improvements to BM25 and Language Models Examined : http://www.cs.otago.ac.nz/homepages/andrew/papers/2014-2.pdf
4.審查的BM25和語言模型的改進: http : //www.cs.otago.ac.nz/homepages/andrew/papers/2014-2.pdf
翻譯自: https://medium.com/@vispute.ak/build-a-search-engine-based-on-stack-overflow-questions-88a4bc0c195c
基于堆棧的緩沖區溢出
總結
以上是生活随笔為你收集整理的基于堆栈的缓冲区溢出_基于堆栈溢出问题构建搜索引擎的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: CHD搭建的环境中,解决用户权限的问题
- 下一篇: linux ps1 日期格式,Linux