案例逐步演示python利用正则表达式提取指定内容并输出到csv
生活随笔
收集整理的這篇文章主要介紹了
案例逐步演示python利用正则表达式提取指定内容并输出到csv
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
背景和目標
這次我想要處理的是一個txt文件,里面的內容是一臺機器定時ping另一臺機器的輸出結果,想要提取出的內容是時間和rtt值,最后還要把結果輸出到csv文件。
1. 明確要提取的內容,編寫正則表達式
要提取的文本如下:
第一步是要編寫正則表達式,此時可以先不要讀取數據文件。先復制一部分數據到str中,方便測試。
編寫正則表達式用到了re模塊,因為每個人要處理的文本是不一樣的,所以需要自己去學習基本的使用方法。re具體使用方法可以參考這篇文章:
https://zhuanlan.zhihu.com/p/139596371
關鍵就是弄清楚.*?和{}的作用,還有re.S可以匹配到換行符,就可以比較容易地寫出正確的表達式。
import re # 為了方便測試,我把一部分文本先放到str里 str=''' 2022-03-11 15:21:48 1 PING 81.71.51.181 (81.71.51.181) 56(84) bytes of data. 64 bytes from 81.71.51.181: icmp_seq=1 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=2 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=3 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=4 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=5 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=6 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=7 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=8 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=9 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=10 ttl=45 time=253 ms--- 81.71.51.181 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9000ms rtt min/avg/max/mdev = 250.203/250.563/253.202/0.961 ms 2022-03-11 15:22:40 2 PING 81.71.51.181 (81.71.51.181) 56(84) bytes of data. 64 bytes from 81.71.51.181: icmp_seq=1 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=2 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=3 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=4 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=5 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=6 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=7 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=8 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=9 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=10 ttl=45 time=250 ms--- 81.71.51.181 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9009ms rtt min/avg/max/mdev = 250.181/250.256/250.434/0.636 ms 2022-03-11 15:23:44 3 PING 81.71.51.181 (81.71.51.181) 56(84) bytes of data. 64 bytes from 81.71.51.181: icmp_seq=1 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=2 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=3 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=4 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=5 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=6 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=7 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=8 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=9 ttl=45 time=250 ms 64 bytes from 81.71.51.181: icmp_seq=10 ttl=45 time=250 ms--- 81.71.51.181 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9009ms rtt min/avg/max/mdev = 250.209/250.320/250.658/0.563 ms '''# print(re.findall(r'(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2})', str)) # 提取時間 # print(re.findall(r'mdev = (.*?) ms', str)) # 提取rttprint(re.findall(r'(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}).*?mdev = (.*?) ms', data, re.S)) # 提取時間和rtt 包括換行輸出:
D:\python37\python.exe D:/test/data_process.py ['2022-03-11 15:21', '2022-03-11 15:22', '2022-03-11 15:23'] ['250.203/250.563/253.202/0.961', '250.181/250.256/250.434/0.636', '250.209/250.320/250.658/0.563'] [('2022-03-11 15:21', '250.203/250.563/253.202/0.961'), ('2022-03-11 15:22', '250.181/250.256/250.434/0.636'), ('2022-03-11 15:23', '250.209/250.320/250.658/0.563')]Process finished with exit code 02. 從文件中讀入數據
編寫出正確的正則表達式后,就可以從文件中讀數據了
import re # 讀取文件 with open("ping/ping_flkf_gz.txt","r") as input_file:str = input_file.read()print(re.findall(r'(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}).*?mdev = (.*?) ms', str, re.S)) # 提取時間和延遲 包括換行input_file.close() # 關閉文件輸出比較多,截取一部分展示:
D:\python37\python.exe D:/test/data_process.py [('2022-03-11 15:21', '250.203/250.563/253.202/0.961'), ('2022-03-11 15:22', '250.181/250.256/250.434/0.636'), ('2022-03-11 15:23', '250.209/250.320/250.658/0.563'), ('2022-03-11 15:25', '250.183/250.240/250.275/0.225'), ('2022-03-11 15:26', '250.217/250.240/250.300/0.592'), ('2022-03-11 15:27', '250.166/250.362/250.956/0.683'), ('2022-03-11 15:28', '250.186/250.256/250.343/0.319'), ('2022-03-11 15:29', '250.181/250.435/252.077/0.776'), ('2022-03-11 15:30', '250.177/250.249/250.401/0.673'), ('2022-03-11 15:31', '250.210/250.436/251.498/0.376'), ('2022-03-11 15:32', '250.207/250.280/250.588/0.401'), ('2022-03-11 15:33', '250.237/250.336/250.747/0.568'), ('2022-03-11 15:34', '250.217/250.283/250.437/0.675'), ('2022-03-11 15:35', '250.254/250.456/251.092/0.623'), ('2022-03-11 15:36', '250.167/250.236/250.308/0.226'), ('2022-03-11 15:37', '250.162/250.399/251.032/0.667'), ('2022-03-11 15:38', '250.207/250.261/250.406/0.053'), ('2022-03-11 15:39', '250.219/250.657/252.056/0.878')]這里其實是一個列表,里面的每個元組是我提取出來的時間和rtt。
3. 寫入csv
能夠正確讀取輸入文件并提取數據后,下一步就是要把結果寫入csv文件,所以用到了csv模塊。
for循環遍歷列表,使用csv_writer.writerow一行行寫入csv文件。
結果就寫入到csv文件中了
time,latency 2022-03-11 15:21,250.203/250.563/253.202/0.961 2022-03-11 15:22,250.181/250.256/250.434/0.636 2022-03-11 15:23,250.209/250.320/250.658/0.563 2022-03-11 15:25,250.183/250.240/250.275/0.225 2022-03-11 15:26,250.217/250.240/250.300/0.592 2022-03-11 15:27,250.166/250.362/250.956/0.683 2022-03-11 15:28,250.186/250.256/250.343/0.319 2022-03-11 15:29,250.181/250.435/252.077/0.776 2022-03-11 15:30,250.177/250.249/250.401/0.673 2022-03-11 15:31,250.210/250.436/251.498/0.3764. 還可以把每個數值分開存放
發現此時latency那一列是這樣的250.203/250.563/253.202/0.961
為了后面方便處理,把每個數值單獨作為一列,因此要修改正則表達式
輸出到csv文件的效果:
至此就完成了~
總結
以上是生活随笔為你收集整理的案例逐步演示python利用正则表达式提取指定内容并输出到csv的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 2月刷题记录
- 下一篇: 实例演示使用HiBench对Hadoop