twitter 数据集处理_Twitter数据清理和数据科学预处理
twitter 數(shù)據(jù)集處理
In the past decade, new forms of communication, such as microblogging and text messaging have emerged and become ubiquitous. While there is no limit to the range of information conveyed by tweets and texts, often these short messages are used to share opinions and sentiments that people have about what is going on in the world around them.
在 過(guò)去的十年中,諸如微博和文本消息之類(lèi)的新通信形式已經(jīng)出現(xiàn)并無(wú)處不在。 盡管對(duì)推文和文本傳達(dá)的信息范圍沒(méi)有限制,但這些短消息通常用于分享人們對(duì)周?chē)澜缯诎l(fā)生的事情的看法和觀點(diǎn)。
Opinion mining (known as sentiment analysis or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.
觀點(diǎn)挖掘(稱(chēng)為情感分析或情感AI)是指使用自然語(yǔ)言處理,文本分析,計(jì)算語(yǔ)言學(xué)和生物識(shí)別技術(shù)來(lái)系統(tǒng)地識(shí)別,提取,量化和研究情感狀態(tài)和主觀信息。 情緒分析廣泛應(yīng)用于客戶(hù)材料的聲音,例如評(píng)論和調(diào)查響應(yīng),在線和社交媒體以及醫(yī)療保健材料,其應(yīng)用范圍從營(yíng)銷(xiāo)到客戶(hù)服務(wù)再到臨床醫(yī)學(xué)。
Both Lexion and Machine learning-based approach will be used to for Emoticons based sentiment analysis. Firstly we stand up with the Machine Learning based clustering. In MachineLearning based approach we are used Supervised and Unsupervised learning methods. The twitter data are collected and given as input in the system. The system classifies each tweets data as Positive, Negative and Neutral and also produce the positive, negative and neutral no of tweets of each emoticon separately in the output. Besides being the polarity of each tweet is also determined on the basis of polarity.
Lexion和基于機(jī)器學(xué)習(xí)的方法都將用于基于表情的情緒分析。 首先,我們支持基于機(jī)器學(xué)習(xí)的集群。 在基于MachineLearning的方法中,我們使用了有監(jiān)督和無(wú)監(jiān)督的學(xué)習(xí)方法。 收集twitter數(shù)據(jù)并作為系統(tǒng)中的輸入給出。 系統(tǒng)將每個(gè)推文數(shù)據(jù)分類(lèi)為“正”,“負(fù)”和“中性”,并且還分別在輸出中生成每個(gè)表情符號(hào)的正,負(fù)和中性no。 除了作為每個(gè)推文的極性之外,還基于極性來(lái)確定。
Collection of Data
資料收集
To collecting the twitter data, we have to do some data mining process. In that process, we have created our own applicating with help of twitter API. With the help of twitter API, we have collected a large no of the dataset . From this, we have to create a developer account and register our app. Here we received a consumer key and a consumer secret: these are used in application settings and from the configuration page of the app we also require an access token and an access token secrets which provide the application access to Twitter on behalf of the account. The process is divided into two sub-process. This is discussed in the next subsection.
要收集Twitter數(shù)據(jù),我們必須執(zhí)行一些數(shù)據(jù)挖掘過(guò)程。 在此過(guò)程中,我們借助twitter API創(chuàng)建了自己的應(yīng)用程序。 借助twitter API,我們已收集了大量數(shù)據(jù)集。 由此,我們必須創(chuàng)建一個(gè)開(kāi)發(fā)人員帳戶(hù)并注冊(cè)我們的應(yīng)用程序。 在這里,我們收到了一個(gè)用戶(hù)密鑰和一個(gè)消費(fèi)者密鑰:這些密鑰用于應(yīng)用程序設(shè)置中,并且在應(yīng)用程序的配置頁(yè)面中,我們還需要訪問(wèn)令牌和訪問(wèn)令牌密鑰,以代表帳戶(hù)向Twitter提供應(yīng)用程序訪問(wèn)權(quán)限。 該過(guò)程分為兩個(gè)子過(guò)程。 下一部分將對(duì)此進(jìn)行討論。
Accessing Twitter Data and Strimming
訪問(wèn)Twitter數(shù)據(jù)并加強(qiáng)
To make the application and to interact with twitter services we use Twitter provided REST API. We use a bunch of Python-based clients. The API variable is now our entry point for most of the operations we can perform with Twitter. The API provides features to access different types of data. In this way, we can easily collect tweets (and more) and store them in the system. By default, the data is in JSON format, we change it to txt format for easy accessibility.
為了制作應(yīng)用程序并與Twitter服務(wù)進(jìn)行交互,我們使用Twitter提供的REST API。 我們使用了許多基于Python的客戶(hù)端。 現(xiàn)在,API變量是我們可以使用Twitter執(zhí)行的大多數(shù)操作的入口點(diǎn)。 該API提供了訪問(wèn)不同類(lèi)型數(shù)據(jù)的功能。 這樣,我們可以輕松地收集(和更多)推文并將其存儲(chǔ)在系統(tǒng)中。 默認(rèn)情況下,數(shù)據(jù)采用JSON格式,為了方便訪問(wèn),我們將其更改為txt格式。
In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the streaming API is what we need. By extending and customizing the stream-listener process, we processed the incoming data. This way, we gather a lot of tweets. This is especially true for live events with worldwide live coverage.
如果我們想“保持連接打開(kāi)”并收集有關(guān)特定事件的所有即將發(fā)布的推文,則需要流API。 通過(guò)擴(kuò)展和定制流偵聽(tīng)器過(guò)程,我們處理了傳入的數(shù)據(jù)。 這樣,我們收集了很多推文。 對(duì)于具有全球?qū)崟r(shí)報(bào)道的現(xiàn)場(chǎng)活動(dòng)尤其如此。
# Twitter Sentiment Analysis import sys import csv import tweepy import matplotlib.pyplot as pltfrom collections import Counterif sys.version_info[0] < 3:input = raw_input## Twitter credentials consumer_key = "------------" consumer_secret = "------------" access_token = "----------" access_token_secret = "-----------"## set up an instance of Tweepy auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) api = tweepy.API(auth)## set up an instance of the AYLIEN Text API client = textapi.Client(application_id, application_key)## search Twitter for something that interests you query = input("What subject do you want to analyze for this example? \n") number = input("How many Tweets do you want to analyze? \n")results = api.search(lang="en",q=query + " -rt",count=number,result_type="recent" )print("--- Gathered Tweets \n")## open a csv file to store the Tweets and their sentiment file_name = 'Sentiment_Analysis_of_{}_Tweets_About_{}.csv'.format(number, query)with open(file_name, 'w', newline='') as csvfile:csv_writer = csv.DictWriter(f=csvfile,fieldnames=["Tweet", "Sentiment"])csv_writer.writeheader()print("--- Opened a CSV file to store the results of your sentiment analysis... \n")## tidy up the Tweets and send each to the AYLIEN Text APIfor c, result in enumerate(results, start=1):tweet = result.texttidy_tweet = tweet.strip().encode('ascii', 'ignore')if len(tweet) == 0:print('Empty Tweet')continueresponse = client.Sentiment({'text': tidy_tweet})csv_writer.writerow({'Tweet': response['text'],'Sentiment': response['polarity']})print("Analyzed Tweet {}".format(c))Data Pre-Processing and Cleaning
數(shù)據(jù)預(yù)處理和清理
The data pre-processing steps perform the necessary data pre-processing and cleaning on the collected dataset. On the previously collected dataset, the are some key attributes text: the text of the tweet itself, created_at: the date of creation,favorite_count, retweet_count: the number of favourites and retweets, favourited, retweeted: boolean stating whether the authenticated user (you) have favourited or retweeted this tweet etc. We have applied an extensive set of pre-processing steps to decrease the size of the feature set to make it suitable for learning algorithms. The cleaning method is based on dictionary methods.
數(shù)據(jù)預(yù)處理步驟對(duì)收集的數(shù)據(jù)集執(zhí)行必要的數(shù)據(jù)預(yù)處理和清理。 在先前收集的數(shù)據(jù)集上,有一些關(guān)鍵屬性文本:tweet本身的文本,created_at:創(chuàng)建日期,favorite_count,retweet_count:收藏和轉(zhuǎn)推的數(shù)量,已收藏,已轉(zhuǎn)推:布爾值,指明是否通過(guò)身份驗(yàn)證的用戶(hù)(您)對(duì)此推文等有幫助或轉(zhuǎn)發(fā)。我們已應(yīng)用了廣泛的預(yù)處理步驟,以減小功能集的大小,使其適合于學(xué)習(xí)算法。 清潔方法基于字典方法。
Data obtained from twitter usually contains a lot of HTML entities like < > & which gets embedded in the original data. It is thus necessary to get rid of these entities. One approach is to directly remove them by the use of specific regular expressions. Hare, we are using the HTML parser module of Python which can convert these entities to standard HTML tags. For example < is converted to “<” and & is converted to “&”. After this, we are removing this special HTML Character and links. In decoding data, this is the process of transforming information from complex symbols to simple and easier to understand characters. The collected data uses different forms of decoding like “Latin”, “UTF8” etc.
從Twitter獲得的數(shù)據(jù)通常包含許多HTML實(shí)體,例如&lt; &gt; &amp; 嵌入到原始數(shù)據(jù)中。 因此有必要擺脫這些實(shí)體。 一種方法是通過(guò)使用特定的正則表達(dá)式直接刪除它們。 野兔,我們正在使用PythonHTML解析器模塊,該模塊可以將這些實(shí)體轉(zhuǎn)換為標(biāo)準(zhǔn)HTML標(biāo)記。 例如&lt; 轉(zhuǎn)換為“ <”和&amp; 轉(zhuǎn)換為“&”。 此后,我們將刪除此特殊HTML字符和鏈接。 在解碼數(shù)據(jù)時(shí),這是將信息從復(fù)雜的符號(hào)轉(zhuǎn)換為簡(jiǎn)單易懂的字符的過(guò)程。 收集的數(shù)據(jù)使用不同的解碼形式,例如“拉丁”,“ UTF8”等。
In the twitter datasets, there is also other information as retweet, Hashtag, Username and modified tweets. All of this is ignored and removed from the dataset.
在Twitter數(shù)據(jù)集中,還有其他信息,如轉(zhuǎn)推,標(biāo)簽,用戶(hù)名和已修改的推文。 所有這些都將被忽略并從數(shù)據(jù)集中刪除。
from nltk import word_tokenize from nltk.corpus import wordnet from nltk.corpus import words from nltk.tokenize import sent_tokenize, word_tokenize from nltk import pos_tag, pos_tag_sents#import for bag of word import numpy as np #For the regular expression import re #Textblob dependency from textblob import TextBlob from textblob import Word #set to string from ast import literal_eval #From src dependency from sentencecounter import no_sentences,getline,gettempwords import os def getsysets(word):syns = wordnet.synsets(word) #wordnet from ntlk.corpus will not work with textblob#print(syns[0].name()) #print(syns[0].lemmas()[0].name()) #get synsets names #print(syns[0].definition()) #defination #print(syns[0].examples()) #example# getsysets("good")def getsynonyms(word):synonyms = []# antonyms = []for syn in wordnet.synsets(word):for l in syn.lemmas():synonyms.append(l.name())# if l.antonyms():# antonyms.append(l.antonyms()[0].name())# print(set(synonyms))return(set(synonyms))# print(set(antonyms))# getsynonyms_and_antonyms("good")def extract_words(sentence):ignore_words = ['a']words = re.sub("[^\w]", " ", sentence).split() #nltk.word_tokenize(sentence)words_cleaned = [w.lower() for w in words if w not in ignore_words]return words_cleaned def tokenize_sentences(sentences):words = []for sentence in sentences:w = extract_words(sentence)words.extend(w)words = sorted(list(set(words)))return wordsdef bagofwords(sentence, words):sentence_words = extract_words(sentence)# frequency word countbag = np.zeros(len(words))for sw in sentence_words:for i,word in enumerate(words):if word == sw: bag[i] += 1return np.array(bag)def tokenizer(sentences):token = word_tokenize(sentences)return tokenprint("#"*100)print (sent_tokenize(sentences))print (token)print("#"*100)# sentences = "Machine learning is great","Natural Language Processing is a complex field","Natural Language Processing is used in machine learning" # vocabulary = tokenize_sentences(sentences) # print (vocabulary) # tokenizer(sentences)def createposfile(filename,word):# filename = input("Enter destination file name in string format :")f = open(filename,'w')f.writelines(word+'\n')def createnegfile(filename,word):# filename = input("Enter destination file name in string format :")f = open(filename,'w')f.writelines(word)def getsortedsynonyms(word):sortedsynonyms = sorted(getsynonyms(word))return sortedsynonymsdef getlengthofarray(word):return getsortedsynonyms(word).__len__()def readposfile():f = open('list of positive words.txt')return f# def searchword(word, sourcename): # if word in open('list of negative words.txt').read(): # createnegfile('destinationposfile.txt',word) # elif word in open('list of positive words.txt').read(): # createposfile('destinationnegfile.txt',word) # else: # for i in range (0,getlengthofarray(word)): # searchword(getsortedsynonyms(word)[i],sourcename)def searchword(word,srcfile):# if word in open('list of negative words.txt').read():# createnegfile('destinationposfile.txt',word)if word in open('list of positive words.txt').read():createposfile('destinationnegfile.txt',word)else:for i in range(0,getlengthofarray(word)):searchword(sorted(getsynonyms(word))[i],srcfile)f = open(srcfile,'w')f.writelines(word)print ('#'*50) # searchword('lol','a.txt') print(readposfile()) # tokenizer(sentences) # getsynonyms('good') # print(sorted(getsynonyms('good'))[2]) #finding an array object [hear it's 3rd object] print ('#'*50) # print (getsortedsynonyms('bad').__len__()) # createposfile('created.txt','lol') # for word in word_tokenize(getline()): # searchword(word,'a.txt')Stop words are generally thought to be a “single set of words”. We would not want these words taking up space in our database. For this using NLTK and using a “Stop Word Dictionary” . The stop words are removed as they are not useful.All the punctuation marks according to the priorities should be dealt with. For example: “.”, “,”,”?” are important punctuations that should be retained while others need to be removed. In the twitter datasets, there is also other information as retweet, Hashtag, Username and Modified tweets. All of this is ignored and removed from the dataset. We should remove these duplicates, which we already did. Sometimes it is better to remove duplicate data based on a set of unique identifiers. For example, the chances of two transactions happening at the same time, with the same square footage, the same price, and the same build year are close to zero.
停用詞通常被認(rèn)為是“單個(gè)詞集”。 我們不希望這些單詞占用數(shù)據(jù)庫(kù)中的空間。 為此,請(qǐng)使用NLTK并使用“停用詞詞典”。 停用詞因無(wú)用而被刪除。應(yīng)根據(jù)優(yōu)先級(jí)處理所有標(biāo)點(diǎn)符號(hào)。 例如: ”。”, ”,”,”?” 是重要的標(biāo)點(diǎn)符號(hào),應(yīng)保留下來(lái),而其他標(biāo)點(diǎn)符號(hào)則需要?jiǎng)h除。 在Twitter數(shù)據(jù)集中,還存在其他信息,如轉(zhuǎn)推,標(biāo)簽,用戶(hù)名和修改過(guò)的推文。 所有這些都將被忽略并從數(shù)據(jù)集中刪除。 我們應(yīng)該刪除這些重復(fù)項(xiàng),而我們已經(jīng)這樣做了。 有時(shí)最好根據(jù)一組唯一的標(biāo)識(shí)符刪除重復(fù)的數(shù)據(jù)。 例如,以相同的平方英尺,相同的價(jià)格和相同的建造年份,同時(shí)進(jìn)行兩次交易的機(jī)會(huì)幾乎為零。
Thank you for reading.
感謝您的閱讀。
I hope you found this data cleaning guide helpful. Please leave any comments to let us know your thoughts.
希望本數(shù)據(jù)清理指南對(duì)您有所幫助。 請(qǐng)留下任何評(píng)論,讓我們知道您的想法。
To read previous part of the series -
要閱讀本系列的前一部分-
https://medium.com/@sayanmondal2098/sentimental-analysis-of-twitter-emoji-64432793b76f
https://medium.com/@sayanmondal2098/sentimental-analysis-of-twitter-emoji-64432793b76f
翻譯自: https://medium.com/swlh/twitter-data-cleaning-and-preprocessing-for-data-science-3ca0ea80e5cd
twitter 數(shù)據(jù)集處理
總結(jié)
以上是生活随笔為你收集整理的twitter 数据集处理_Twitter数据清理和数据科学预处理的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到出国买东西是什么意思
- 下一篇: 梦到房子被烧了是什么意思