Python爬虫教程,利用Python采集QQ群成员信息
生活随笔
收集整理的這篇文章主要介紹了
Python爬虫教程,利用Python采集QQ群成员信息
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
Python–selenium 加載并保存QQ群成員,去除其群主、管理員信息。
基本思路
- 模擬登陸頁面
-
- 頁面分析
- 代碼實現(xiàn)
- 選擇所需加載群
-
- 頁面分析
- 代碼實現(xiàn)
- 保存所需信息
-
- 頁面分析
- 代碼實現(xiàn)
- 完整代碼
-
- 很多人學(xué)習(xí)python,不知道從何學(xué)起。
很多人學(xué)習(xí)python,掌握了基本語法過后,不知道在哪里尋找案例上手。
很多已經(jīng)做案例的人,卻不知道如何去學(xué)習(xí)更加高深的知識。
那么針對這三類人,我給大家提供一個好的學(xué)習(xí)平臺,免費領(lǐng)取視頻教程,電子書籍,以及課程的源代碼!??¤
QQ群:623406465
- 很多人學(xué)習(xí)python,不知道從何學(xué)起。
?
模擬登陸頁面
頁面分析
思路:
點擊登陸按鈕
選擇要登陸的賬號
代碼實現(xiàn)
# Author:smart_num_1 # Blog:https://blog.csdn.net/smart_num_1 # WeChat:Be_a_lucky_dogfrom selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWaitdef login(driver = None):already_dic = {}# 創(chuàng)建一個字典,保存電腦登陸的QQlogin_button = WebDriverWait(driver = driver,timeout = 100).until(EC.presence_of_element_located((By.XPATH,'//p[@class="user-info"]/a')))login_button.click()# 點擊登錄,獲取電腦登陸的QQalready_login_number = WebDriverWait(driver = driver,timeout = 100).until(EC.presence_of_element_located((By.XPATH,'//div[@id="loginWin"]/iframe')))driver.get(url = already_login_number.get_attribute('src'))# 此步驟目的,是因為登錄框是一個子頁面,在上一級頁面中獲得到的這個子頁面already_login_numbers = WebDriverWait(driver = driver,timeout = 100).until(EC.presence_of_all_elements_located((By.XPATH,'//span[contains(@class,"nick")]')))# 獲取電腦登陸的QQprint('在以下賬號中選擇所需賬號')for already_login_number in already_login_numbers:already_dic[already_login_number.get_attribute('innerText')] = already_login_numberprint(already_login_number.get_attribute('innerText'))QQ_NeedToLogin = input('需要登陸: ')# 通過獲取鍵名,在 already_dic 獲得相應(yīng)的鍵值already_dic[QQ_NeedToLogin].click()# 模擬點擊要登陸的QQ,達到登陸的效果time.sleep(1)if __name__ == '__main__':start_url = 'https://qun.qq.com/index.html#click' # 群首頁,用來登陸賬號driver = webdriver.Chrome(executable_path = './chromedriver.exe')# 因為selenium 需要用到瀏覽器、瀏覽器驅(qū)動,但是還要配置環(huán)境變量,很麻煩,如果這樣指定 webdriver 路徑的話,就可以省去那一步driver.get(url=start_url)login(driver=driver)選擇所需加載群
頁面分析
打開群管理界面,會看到這樣的信息,我們的目的是爬取已加入群的成員信息
代碼實現(xiàn)
# Author:smart_num_1 # Blog:https://blog.csdn.net/smart_num_1 # WeChat:Be_a_lucky_dogdef get_group_number(driver = None):group_number_dic = {}# 同樣的,利用字典儲存信息my_group_list = WebDriverWait(driver = driver,timeout = 100).until(EC.presence_of_all_elements_located((By.XPATH,'//ul[@class="my-group-list"]/li')))# 獲取每個已加入群的節(jié)點信息print('在以下群中選擇:')i = 1for my_group in my_group_list:try:group_number_dic[str(i)] = my_groupprint('第 %s 個--- '%str(i) + my_group.get_attribute('title') + ' ' + my_group.get_attribute('data-id'))i += 1except:continue# 打印出獲得的群信息,獲取所有的目標(biāo)群group = input('獲取群編號 : ')# 通過鍵名獲取鍵值,得到要點擊的目標(biāo)group_number_dic[group].click()return driverif __name__ == '__main__':member_url_test = 'https://qun.qq.com/member.html'driver.get(url = member_url_test)driver = get_group_number(driver=driver)保存所需信息
頁面分析
可以看到,是個動態(tài)加載的頁面,因為用的是selenium,所以就沒必要分析到底是通過請求那個url得到的信息,直接模擬滾動獲取就可以了
代碼實現(xiàn)
# Author:smart_num_1 # Blog:https://blog.csdn.net/smart_num_1 # WeChat:Be_a_lucky_dogdef get_group_member(driver = None):driver.refresh()# 刷新一下界面,防止上一步點擊過后,頁面不更新的情況elem_end = WebDriverWait(driver = driver,timeout = 100).until(EC.presence_of_element_located((By.XPATH,'//td[@class="td-user-nick"]/img')))# 添加了等待,這個定位可以隨便的選擇,確保頁面加載完畢的for i in range(10):time.sleep(0.5)driver.execute_script("var action=document.documentElement.scrollTop=10000")print('加載中······')# 這個滾動范圍可以任選,因為每次會加載21個信息,我看過我加的群,在10次過后的成員基本屬于潛水的人了,要不要的就無所謂了group_members = driver.find_elements_by_xpath('//tr[contains(@class,"mb")]')for group_member in group_members:try:data = group_member.text.split('\n')[2].split(' ')[0]# 這一步,得到一個列表,從第一位開始分別是成員、群昵稱、QQ號、性別、Q齡、入群時間、等級(積分)、最后發(fā)言,在這里我是只需要QQ號碼#對于其他信息,根據(jù)自己需要,添加代碼即可if data.isdigit() == True:with open('./record.txt','a',encoding = 'utf-8') as record:record.write(data + '@qq.com')record.write('\n')except:continueprint('Loaded')完整代碼
# Author:smart_num_1 # Blog:https://blog.csdn.net/smart_num_1 # WeChat:Be_a_lucky_dogfrom selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.chrome.options import Options import time import random import osdef get_group_member(driver = None):driver.refresh()elem_end = WebDriverWait(driver = driver,timeout = 100).until(EC.presence_of_element_located((By.XPATH,'//td[@class="td-user-nick"]/img')))for i in range(10):time.sleep(0.5)driver.execute_script("var action=document.documentElement.scrollTop=10000")print('加載中······')group_members = driver.find_elements_by_xpath('//tr[contains(@class,"mb")]')for group_member in group_members:try:data = group_member.text.split('\n')[2].split(' ')[0]if data.isdigit() == True:with open('./record.txt','a',encoding = 'utf-8') as record:record.write(data + '@qq.com')record.write('\n')except:continueprint('Loaded')def get_group_number(driver = None):group_number_dic = {}my_group_list = WebDriverWait(driver = driver,timeout = 100).until(EC.presence_of_all_elements_located((By.XPATH,'//ul[@class="my-group-list"]/li')))print('在以下群中選擇:')i = 1for my_group in my_group_list:try:group_number_dic[str(i)] = my_groupprint('第 %s 個--- '%str(i) + my_group.get_attribute('title') + ' ' + my_group.get_attribute('data-id'))i += 1except:continuegroup = input('獲取群編號 : ')group_number_dic[group].click()return driverdef login(driver = None):already_dic = {}login_button = WebDriverWait(driver = driver,timeout = 100).until(EC.presence_of_element_located((By.XPATH,'//p[@class="user-info"]/a')))login_button.click()already_login_number = WebDriverWait(driver = driver,timeout = 100).until(EC.presence_of_element_located((By.XPATH,'//div[@id="loginWin"]/iframe')))driver.get(url = already_login_number.get_attribute('src'))already_login_numbers = WebDriverWait(driver = driver,timeout = 100).until(EC.presence_of_all_elements_located((By.XPATH,'//span[contains(@class,"nick")]')))print('在以下賬號中選擇所需賬號')for already_login_number in already_login_numbers:already_dic[already_login_number.get_attribute('innerText')] = already_login_numberprint(already_login_number.get_attribute('innerText'))QQ_NeedToLogin = input('需要登陸: ')already_dic[QQ_NeedToLogin].click()time.sleep(1)def start(driver = None,url = None):print('Please wait for loading\n')driver.get(url = url)driver = get_group_number(driver=driver)print('Please wait for loading\n')get_group_member(driver=driver)if __name__ == '__main__':print('Please wait for loading')chrome_options=Options()chrome_options.add_argument('--headless')try:random.seed(time.time())QQ_number = '738334209'start_url = 'https://qun.qq.com/index.html#click'member_url = 'https://qun.qq.com/member.html#gid=%s'%QQ_numbermember_url_test = 'https://qun.qq.com/member.html'driver = webdriver.Chrome(executable_path = './chromedriver.exe',chrome_options=chrome_options)try:driver.get(url=start_url)login(driver=driver)while True:start(driver = driver,url = member_url_test)flag = input('是否繼續(xù)爬取? yes or no : ')if flag == 'no':breakos.system('cls')driver.quit()except:print('Something wrong')driver.quit()except:print('Something wrong!!!!!!')os.system('pause')總結(jié)
以上是生活随笔為你收集整理的Python爬虫教程,利用Python采集QQ群成员信息的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: mysql 参数bug_MySQL 的这
- 下一篇: JSON基础与数据解析、JSON方法、A