當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

ML之H-Clusters：基于H-Clusters算法利用电影数据集实现对top 100电影进行文档分类

發(fā)布時(shí)間：2025/3/21 编程问答 19 豆豆

生活随笔收集整理的這篇文章主要介紹了 ML之H-Clusters：基于H-Clusters算法利用电影数据集实现对top 100电影进行文档分类小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

ML之H-Clusters：基于H-Clusters算法利用電影數(shù)據(jù)集實(shí)現(xiàn)對top 100電影進(jìn)行文檔分類

輸出結(jié)果

實(shí)現(xiàn)代碼

輸出結(jié)果

先看輸出結(jié)果

實(shí)現(xiàn)代碼

# -*- coding: utf-8 -*-import numpy as np import pandas as pd import nltk from bs4 import BeautifulSoup import re import os import codecs from sklearn import feature_extraction#import three lists: titles, links and wikipedia synopses titles = open('document_cluster_master/title_list.txt').read().split('\n') #ensures that only the first 100 are read in titles = titles[:100]links = open('document_cluster_master/link_list_imdb.txt').read().split('\n') links = links[:100]synopses_wiki = open('document_cluster_master/synopses_list_wiki.txt').read().split('\n BREAKS HERE') synopses_wiki = synopses_wiki[:100]synopses_clean_wiki = [] for text in synopses_wiki:text = BeautifulSoup(text, 'html.parser').getText()#strips html formatting and converts to unicodesynopses_clean_wiki.append(text)synopses_wiki = synopses_clean_wikigenres = open('document_cluster_master/genres_list.txt').read().split('\n') genres = genres[:100]print(str(len(titles)) + ' titles') print(str(len(links)) + ' links') print(str(len(synopses_wiki)) + ' synopses') print(str(len(genres)) + ' genres')synopses_imdb = open('document_cluster_master/synopses_list_imdb.txt').read().split('\n BREAKS HERE') synopses_imdb = synopses_imdb[:100]synopses_clean_imdb = []for text in synopses_imdb:text = BeautifulSoup(text, 'html.parser').getText()#strips html formatting and converts to unicodesynopses_clean_imdb.append(text)synopses_imdb = synopses_clean_imdbsynopses = []for i in range(len(synopses_wiki)):item = synopses_wiki[i] + synopses_imdb[i]synopses.append(item)# generates index for each item in the corpora (in this case it's just rank) and I'll use this for scoring later #為語料庫中的每一個(gè)項(xiàng)目生成索引 ranks = [] for i in range(0,len(titles)):ranks.append(i)#定義一些函數(shù)對劇情簡介進(jìn)行處理。首先，載入 NLTK 的英文停用詞列表。停用詞是類似“a”，“the”，或者“in”這些無法傳達(dá)重要意義的詞。我相信除此之外還有更好的解釋。 # load nltk's English stopwords as variable called 'stopwords' stopwords = nltk.corpus.stopwords.words('english')print (stopwords[:10]) #可以查看一下#接下來我導(dǎo)入 NLTK 中的 Snowball 詞干分析器（Stemmer）。詞干化（Stemming）的過程就是將詞打回原形，其實(shí)就是把長得很像的英文單詞關(guān)聯(lián)在一起。 # load nltk's SnowballStemmer as variabled 'stemmer' from nltk.stem.snowball import SnowballStemmer stemmer = SnowballStemmer("english")# tokenize_and_stem：對每個(gè)詞例（token）分詞（tokenizes）（將劇情簡介分割成單獨(dú)的詞或詞例列表）并詞干化 # tokenize_only: 分詞即可# 這里我定義了一個(gè)分詞器（tokenizer）和詞干分析器（stemmer），它們會(huì)輸出給定文本詞干化后的詞集合 def tokenize_and_stem(text):# 首先分句，接著分詞，而標(biāo)點(diǎn)也會(huì)作為詞例存在tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]filtered_tokens = []# 過濾所有不含字母的詞例（例如：數(shù)字、純標(biāo)點(diǎn)）for token in tokens:if re.search('[a-zA-Z]', token):filtered_tokens.append(token)stems = [stemmer.stem(t) for t in filtered_tokens]return stemsdef tokenize_only(text):# 首先分句，接著分詞，而標(biāo)點(diǎn)也會(huì)作為詞例存在tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]filtered_tokens = []# 過濾所有不含字母的詞例（例如：數(shù)字、純標(biāo)點(diǎn)）for token in tokens:if re.search('[a-zA-Z]', token):filtered_tokens.append(token)return filtered_tokens# 使用上述詞干化/分詞和分詞函數(shù)遍歷劇情簡介列表以生成兩個(gè)詞匯表：經(jīng)過詞干化和僅僅經(jīng)過分詞后。# 非常不 pythonic，一點(diǎn)也不！ # 擴(kuò)充列表后變成了非常龐大的二維（flat）詞匯表 totalvocab_stemmed = [] totalvocab_tokenized = [] for i in synopses:allwords_stemmed = tokenize_and_stem(i) #對每個(gè)電影的劇情簡介進(jìn)行分詞和詞干化totalvocab_stemmed.extend(allwords_stemmed) # 擴(kuò)充“totalvocab_stemmed”列表allwords_tokenized = tokenize_only(i)totalvocab_tokenized.extend(allwords_tokenized)#一個(gè)可查詢的stemm詞表，以下是詞干化后的詞變回原詞例是一對多（one to many）的過程：詞干化后的“run”能夠關(guān)聯(lián)到“ran”，“runs”，“running”等等。 vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed) print ('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame') print (vocab_frame.head())#利用Tf-idf計(jì)算文本相似度,利用 tf-idf 矩陣，你可以跑一長串聚類算法來更好地理解劇情簡介集里的隱藏結(jié)構(gòu) from sklearn.feature_extraction.text import TfidfVectorizer# 定義向量化參數(shù) tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,min_df=0.2, stop_words='english',use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3)) tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) # 向量化劇情簡介文本 print(tfidf_matrix.shape) #(100, 563)，100個(gè)電影記錄，每個(gè)電影后邊有563個(gè)詞terms = tfidf_vectorizer.get_feature_names() #terms” 這個(gè)變量只是 tf-idf 矩陣中的特征（features）表，也是一個(gè)詞匯表#dist 變量被定義為 1 – 每個(gè)文檔的余弦相似度。余弦相似度用以和 tf-idf 相互參照評價(jià)。可以評價(jià)全文（劇情簡介）中文檔與文檔間的相似度。被 1 減去是為了確保我稍后能在歐氏（euclidean）平面（二維平面）中繪制余弦距離。 # dist 可以用以評估任意兩個(gè)或多個(gè)劇情簡介間的相似度 from sklearn.metrics.pairwise import cosine_similarity dist = 1 - cosine_similarity(tfidf_matrix)#2、采用H-Clustering算法進(jìn)行Hierarchical document clustering import matplotlib.pyplot as plt from scipy.cluster.hierarchy import ward, dendrogramlinkage_matrix = ward(dist) # 聚類算法處理之前計(jì)算得到的距離dist(之前計(jì)算的余弦距離矩陣dist)，用 linkage_matrix 表示fig, ax = plt.subplots(figsize=(15, 20)) # 設(shè)置大小 ax = dendrogram(linkage_matrix, orientation="right", labels=titles);plt.tick_params(axis= 'x', # 使用 x 坐標(biāo)軸which='both', # 同時(shí)使用主刻度標(biāo)簽（major ticks）和次刻度標(biāo)簽（minor ticks）bottom='off', # 取消底部邊緣（bottom edge）標(biāo)簽top='off', # 取消頂部邊緣（top edge）標(biāo)簽labelbottom='off')plt.tight_layout() # 展示緊湊的繪圖布局# 注釋語句用來保存圖片 plt.savefig('ward_clusters.png', dpi=200) # 保存圖片為 ward_clusters

相關(guān)文章推薦
Document Clustering with Python

總結(jié)

以上是生活随笔為你收集整理的ML之H-Clusters：基于H-Clusters算法利用电影数据集实现对top 100电影进行文档分类的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： ML之Clustering之K-mean
下一篇：成功解决Please use the N

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

ML之H-Clusters：基于H-Clusters算法利用电影数据集实现对top 100电影进行文档分类

輸出結(jié)果

實(shí)現(xiàn)代碼

總結(jié)