當前位置：首頁 > 编程语言 > python >内容正文

python

详解Python正则表达式基础操作

發布時間：2023/12/31 python 36 豆豆

生活随笔收集整理的這篇文章主要介紹了详解Python正则表达式基础操作小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

??正則表達式是什么，聽起來很深奧，其實沒什么大不了的，書上也是說的夠玄乎的，給一個表格，一個一個字符給你弄得最后就是實例代碼，真是麻煩！我在B站上找了幾個視頻看了看才整明白，正則表達式攢在一起很麻煩，要不是很多爬蟲都用它我草不學呢🈂?下面我來說說

資料視頻https://www.bilibili.com/video/BV1xs411x71b?from=search&seid=11095771910372981077
您可以直接觀看此視頻，但是并非原創，鏈接在上

詳解Python正則表達式

作者： xiuci🌔訪問空間

基礎部分

正則表達式基礎使用是用來查找某些字段在一個長文本中的方法，正則表達式的全稱叫做Regular Expression，簡稱RegEx，所以百度中你用Regex依舊可以查找到正則表達式。

首先這里有一個詩，poem.txt，大概一百來行

大概介紹一下這首詩，名稱The Man from Snowy River，作者Banjo Paterson，很知名的外國作家，好了如果我想查找這首詩中的"to"的個數，怎么辦？

沒那么麻煩，一行解決問題
實現開始
首先將文件放到text字符串中

import retext = ''file = open("poem.txt")for line in file:text = text + line file.close() print (text)

成功輸入了，那么怎么查找，正則表達式的簡稱是re，庫也是re，直接import進來

result = re.findall(" to ", text) #空開一個空格代表單詞

這一行代碼是查找" to "這個單詞，用空格隔開代表是個單詞，在text中尋找
最后使用這個代碼：

import retext = ''file = open("poem.txt")for line in file:text = text + line file.close() result = re.findall(" to ", text) #空開一個空格代表單詞 print(len(result))

運行結果：

Regex的查找to加空格

下面我要查找以a開頭的三位字符串（VScode也可以使用正則表達式查找，按下Ctrl+h，然后點擊一個小按鈕）

那么正則表達式應該這么寫

re.findall("a..", text)

..可以是任何字符，將這個代碼替換掉剛才的，去掉len()，運行如下

PS E:\ProgramThomas\Coding-Notes\Python-Notes\Regex> python regex.py ['an ', 'anj', 'ate', 'as ', 'at ', 'ati', 'ad ', 'ass', 'aro', 'at ', 'ad ', 'awa', 'ad ', 'as ', 'a t', 'and', 'all', 'ack', 'ad ', 'ath', 'ay.', 'and', 'ati', 'ar ', 'and', 'ad ', 'at ', 'ad ', 'ard', 'are', 'att', 'as ', 'arr', 'ade', 'ard', 'an ', 'air', 'as ', 'as ', 'as ', 'air', 'and', 'an ', 'anc', 'ame', 'a h', 'and', 'an ', 'add', 'and', 'arn', 'ain', 'as ', 'a s', 'a s', 'all', 'and', 'ast', 'as ', 'a r', 'ace', 'a t', 'art', 'at ', 'ast', 'as ', 'are', 'ain', 'as ', 'ard', 'and', 'and', 'at ', 'ay ', 'as ', 'age', 'ati', 'ad;', 'adg', 'ame', 'and', 'and', 'arr', 'age', 'ad.', 'and', 'ay,', 'an ', 'aid', 'at ', 'a l', 'and', 'all', 'ad,', 'awa', 'are', 'ar ', 'as ', 'ait', 'ad ', 'and', 'anc', 'aid', 'arr', 'ant', 'ant', 'at ', 'and', 'are', 'ain', 'ail', 'are', 'as ', 'and', 'as ', 'a h', 'an ', 'at ', 'ain', 'ake', 'ant', 'ave', 'any', 'am,', 'ave', 'a c', 'ace', 'awa', 'ard', 'ain', 'an ', 'ave', 'at ', 'anc', 'anc', 'and', 'ad,', 'and', 'ar ', 'as ', 'at ', 'ain', 'anc', 'as ', 'aci', 'and', 'ake', 'ace', 'ace', 'ast', 'and', 'ade', 'ang', 'as ', 'ace', 'ace', 'alt', 'a m', 'ade', 'ash', 'aw ', 'ain', 'arg', 'ath', 'a s', 'arp', 'and', 'ash', 'ain', 'ast', 'and', 'ack', 'ad,', 'and', 'ans', 'ack', 'and', 'ags', 'at ', 'ad.', 'ard', 'ard', 'ay,', 'ain', 'ash', 'and', 'ajo', 'an ', 'ay ', 'ay,', 'an ', 'an ', 'ach', 'ain', 'anc', 'a p', 'ake', 'ath', 'and', 'as ', 'at ', 'and', 'any', 'as ', 'ath', 'an ', 'ave', 'ad,', 'and', 'ave', 'a c', 'ace', 'ain', 'a t', 'and', 'atc', 'ar.', 'are', 'all', 'an ', 'at ', 'as ', 'and', 'at ', 'ain', 'an ', 'ark', 'and', 'apl', 'and', 'at ', 'a r', 'aci', 'ace', 'and', 'afe', 'and', 'at ', 'as ', 'amo', 'as ', 'atc', 'ain', 'and', 'aw ', 'as ', 'amo', 'ace', 'acr', 'ari', 'a m', 'ain', 'ang', 'a f', 'al ', 'als', 'a d', 'and', 'ant', 'aci', 'an ', 'at ', 'an ', 'and', 'am.', 'a b', 'ack', 'alt', 'and', 'ate', 'ads', 'alo', 'and', 'ass', 'ack', 'ard', 'ain', 'arc', 'ais', 'a t', 'as ', 'as ', 'aun', 'and', 'age', 'as ', 'ain', 'a c', 'ad ', 'ais', 'and', 'att', 'air', 'ar ', 'as ', 'al,', 'and', 'ars', 'air', 'aze', 'and', 'aro', 'and', 'and', 'ain', 'are', 'an ', 'a h', 'ay,']

數出來很多，但是很多都不是單詞，比如'ay,'是啥？，于是我們需要限定后兩個字符，將正則表達式更改如下

re.findall("a[a-z][a-z]", text)

就是后兩個字符只能是a-z，就不可能是符號了，那么還有很多不是單詞的比如'ain'，那么就將正則表達式左右加上空格

re.findall(" a.. ", text) PS E:\ProgramThomas\Coding-Notes\Python-Notes\Regex> python regex.py [' all ', ' and ', ' and ', ' and ', ' and ', ' are ', ' and ', ' and ', ' and ', ' and ', ' and ', ' and ', ' are ', ' and ', ' and ', ' are ', ' are ', ' and ', ' and ', ' and ', ' and ', ' and ', ' and ', ' and ', ' and ', ' and ', ' ash ', ' and ', ' and ', ' and ', ' and ', ' and ', ' and ', ' and ', ' and ', ' and ', ' and ', ' and ', ' and ', ' air ', ' and ', ' and ', ' and ', ' and ', ' are ']

少了很多，于是問題又來了，我們不需要數出來空格對吧，所以用小括號限定需要的部分

re.findall(" (a..) ", text)

只需要a…的部分，空格不要

['all', 'and', 'and', 'and', 'and', 'are', 'and', 'and', 'and', 'and', 'and', 'and', 'are', 'and', 'and', 'are', 'are', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'ash', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'air', 'and', 'and', 'and', 'and', 'are']

結果還不錯！

Regex的查找功能實例

好了那么我們發現輸出結果有很多and，怎么去重，不需要什么就用集合

result = set(result) PS E:\ProgramThomas\Coding-Notes\Python-Notes\Regex> python regex.py {'All', 'ave', 'arp', 'ain', 'And', 'air', 'are', 'age', 'ame', 'ake', 'and', 'afe', 'ars', 'ace', 'any', 'ard', 'ast', 'ath', 'ash', 'all', 'ade', 'ads', 'ags', 'ant'}

那么文章中有些單詞不一定帶有空格但是也是單詞的，比如開頭的And
這個需要用到*號，代表可以有空格也可以沒有空格，正則表達式再次改進

result = re.findall(" *([Aa][a-z][a-z]) ", text)

[Aa]的意思是一個字符可以是大寫A也可以是小寫a，再次用集合去重

{'All', 'ave', 'arp', 'ain', 'And', 'air', 'are', 'age', 'ame', 'ake', 'and', 'afe', 'ars', 'ace', 'any', 'ard', 'ast', 'ath', 'ash', 'all', 'ade', 'ads', 'ags', 'ant'}

這時候會出現一些問題，之前我們查找沒有出現'ace'、'afe'吧，咱們看看文檔中的afe在哪？

原來是safe的后面，來看看我們的定義，用了*號說明可以有空格也可以沒有空格，那好吧怎么辦呢？

可以使用或者|符號改進正則表達式

result = re.findall(" (a[a-z][a-z]) |A[a-z][a-z] ", text)

初始目的是小寫字母a跟著兩個字符字母的單詞，或者A開頭沒有空格跟這兩個字符字母帶結尾空格的單詞。

但是這個帶有或的運算返回的是元組類型，可以自己試試

PS E:\ProgramThomas\Coding-Notes\Python-Notes\Regex> python regex.py {('air', ''), ('and', ''), ('are', ''), ('ash', ''), ('', 'And'), ('all', ''), ('', 'All')}

然后其實不難，或后面的表達式不需要加括號就行" (a[a-z][a-z]) |A[a-z][a-z] "

{'', 'all', 'air', 'and', 'ash', 'are'}

那么還有一個空，那就在最后做一個result.remove('')把空去掉

Regex的特殊字符

https://www.tutorialspoint.com/python/python_reg_expressions
有一些特殊字符下面詳細介紹

??\d，digit判斷，是字符就行
詩中沒有數字，我不會寫詩，只會寫HelloWorld，文檔如下
HelloWorld 123
456
代碼如下
result = re.findall("\d{2,3}", text)
輸出
[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’]
digit判斷可以指定位數，如下\d{2}就是找到兩位的數字，\d{2,3}是兩位到三位，優先選擇大的，\d+就是至少一位

??\w，判斷字符
其實\w可以寫成[A-Za-z]，將正則表達式更改：
result = re.findall("\w{2,3}", text)
那可就多了，能輸出很多
輸出（很多，節選）：
…‘his’, ’ he’, ‘ad.’, ‘\n\nB’, 'ut ', ‘sti’, 'll ', ‘so ‘, ‘sli’, ‘ght’, ’ an’, ‘d w’, ‘eed’, ‘y, ‘, ‘one’, ’ wo’, ‘uld’, ’ do’, ‘ubt’, ’ hi’, ‘s p’, ‘owe’, ‘r t’, ‘o s’, ‘tay’, ‘,\nA’, 'nd ‘, ‘the’, ’ ol’, ‘d m’, 'an ', ‘sai’, 'd, ', ‘"Th’, 'at ', ‘hor’, 'se ', ‘wil’, ‘l n’, ‘eve’, ‘r d’, ‘o\nF’, 'or ‘, ‘a l’, ‘ong’, ’ an’, ‘d t’, ‘iri’, ‘ng ‘, ‘gal’, ‘lop’, ’ --’, ’ la’, ‘d, ‘, ‘you’, “‘d ", ‘bet’, ‘ter’, ’ st’, 'op ', ‘awa’, ‘y,\n’, ‘Tho’, 'se ', ‘hil’, 'ls ‘, ‘are’, ’ fa’, ‘r t’, 'oo ', ‘rou’, 'gh ‘, ‘for’, ’ su’, 'ch ', 'as ', ‘you’, '.”\n’, ‘So ‘, ‘he ‘, ‘wai’, ‘ted’, ’ sa’, ‘d a’, ‘nd ‘, ‘wis’, ‘tfu’, ‘l -’, ‘- o’, ‘nly’, ’ Cl’, ‘anc’, ‘y s’, ‘too’, ‘d h’, ‘is ‘, ‘fri’, ‘end’, ’ --’, ‘\n"I’, ’ th’, ‘ink’, ’ we’, ’ ou’, ‘ght’, ’ to’, ’ le’, ‘t h’, ‘im ‘, ‘com’, ‘e,"’, ’ he’, ’ sa’, ‘id;’, ‘\n"I’, ’ wa’, ‘rra’, 'nt ‘, "he’", ‘ll ‘, ‘be ‘, ‘wit’,
‘h u’, ‘s w’, ‘hen’, ’ he’, "‘s ", ‘wan’, ‘ted’, ’ at’, ’ th’, ‘e e’, ‘nd,’, ‘\nFo’, ‘r b’, ‘oth’, ’ hi’, ‘s
h’, ‘ors’, ‘e a’, ‘nd ‘, ‘he ‘, ‘are’, ’ mo’, ‘unt’, ‘ain’, ’ br’, ‘ed.’, ‘"\n\n’, ‘"He’, ’ ha’, ‘ils’, ’ fr’, ‘om ‘, ‘Sno’, ‘wy ‘, ‘Riv’, ‘er,’, ’ up’, ’ by’, ’ Ko’, ‘sci’, ‘usk’, “o’s”, ’ si’, ‘de,’, ‘\nWh’, ‘ere’, ’ th’, ‘e h’, ‘ill’, ‘s a’, 're ', ‘twi’, 'ce ', 'as ', '…

??\S 匹配任何非空白字符。等價于 [ ^ \f\n\r\t\v]。

代碼：
result = re.findall("\S", text)
輸出（很多，節選）：
…‘h’, ‘e’, ‘y’, ‘g’, ‘a’, ‘i’, ‘n’, ‘t’, ‘h’, ‘e’, ‘s’, ‘h’, ‘e’, ‘l’, ‘t’, ‘e’, ‘r’, ‘o’, ‘f’, ‘t’, ‘h’, ‘o’, ‘s’, ‘e’, ‘h’, ‘i’, ‘l’, ‘l’, ‘s’, ‘.’, ‘"’, ‘S’, ‘o’, ‘C’, ‘l’, ‘a’, ‘n’, ‘c’, ‘y’, ‘r’, ‘o’,
‘d’, ‘e’, ‘t’, ‘o’, ‘w’, ‘h’, ‘e’, ‘e’, ‘l’, ‘t’, ‘h’, ‘e’, ‘m’, ‘-’, ‘-’, ‘h’, ‘e’, ‘w’, ‘a’, ‘s’, ‘r’, ‘a’, ‘c’, ‘i’, ‘n’, ‘g’, ‘o’, ‘n’, ‘t’, ‘h’, ‘e’, ‘w’, ‘i’, ‘n’, ‘g’, ‘W’, ‘h’, ‘e’, ‘r’, ‘e’, ‘t’, ‘h’, ‘e’, ‘b’, ‘e’, ‘s’, ‘t’, ‘a’, ‘n’, ‘d’, ‘b’, ‘o’, ‘l’, ‘d’, ‘e’, ‘s’, ‘t’, ‘r’, ‘i’, ‘d’, ‘e’, ‘r’, ‘s’, ‘t’, ‘a’, ‘k’, ‘e’, ‘t’, ‘h’, ‘e’, ‘i’, ‘r’, ‘p’, ‘l’, ‘a’, ‘c’, ‘e’, ‘,’, ‘A’, ‘n’, ‘d’, ‘h’, ‘e’, ‘r’, ‘a’, ‘c’, ‘e’, ‘d’, ‘h’, ‘i’, ‘s’, ‘s’, ‘t’, ‘o’, ‘c’, ‘k’, ‘-’, ‘h’, ‘o’, ‘r’, ‘s’, ‘e’, ‘p’, ‘a’, ‘s’, ‘t’, ‘t’, ‘h’, ‘e’,
‘m’, ‘,’, ‘a’, ‘n’, ‘d’, ‘h’, ‘e’, ‘m’, ‘a’, ‘d’, ‘e’, ‘t’, ‘h’, ‘e’, ‘r’, ‘a’, ‘n’, ‘g’, ‘e’, ‘s’, ‘r’, ‘i’, ‘n’, ‘g’, ‘W’, ‘i’, ‘t’, ‘h’, ‘t’, ‘h’, ‘e’, ‘s’, ‘t’, ‘o’, ‘c’, ‘k’, ‘w’, ‘h’, '…

以下來自菜鳥教程

## 非打印字符

非打印字符也可以是正則表達式的組成部分。下表列出了表示非打印字符的轉義序列：

字符描述

\cx	匹配由x指明的控制字符。例如， \cM 匹配一個 Control-M 或回車符。x 的值必須為 A-Z 或 a-z 之一。否則，將 c 視為一個原義的 ‘c’ 字符。
\f	匹配一個換頁符。等價于 \x0c 和 \cL。
\n	匹配一個換行符。等價于 \x0a 和 \cJ。
\r	匹配一個回車符。等價于 \x0d 和 \cM。
\s	匹配任何空白字符，包括空格、制表符、換頁符等等。等價于 [ \f\n\r\t\v]。注意 Unicode 正則表達式會匹配全角空格符。
\S	匹配任何非空白字符。等價于 [^ \f\n\r\t\v]。
\t	匹配一個制表符。等價于 \x09 和 \cI。
\v	匹配一個垂直制表符。等價于 \x0b 和 \cK。

特殊字符

所謂特殊字符，就是一些有特殊含義的字符，如上面說的 runoo*b 中的 *****，簡單的說就是表示任何字符串的意思。如果要查找字符串中的 ***** 符號，則需要對 ***** 進行轉義，即在其前加一個 ***: runo*ob 匹配 runoob。

許多元字符要求在試圖匹配它們時特別對待。若要匹配這些特殊字符，必須首先使字符"轉義"，即，將反斜杠字符**** 放在它們前面。下表列出了正則表達式中的特殊字符：

特別字符描述

$	匹配輸入字符串的結尾位置。如果設置了 RegExp 對象的 Multiline 屬性，則 $ 也匹配 ‘\n’ 或 ‘\r’。要匹配 $ 字符本身，請使用 $。
( )	標記一個子表達式的開始和結束位置。子表達式可以獲取供以后使用。要匹配這些字符，請使用 ( 和 )。
*	匹配前面的子表達式零次或多次。要匹配 * 字符，請使用 *。
+	匹配前面的子表達式一次或多次。要匹配 + 字符，請使用 +。
.	匹配除換行符 \n 之外的任何單字符。要匹配 . ，請使用 . 。
[	標記一個中括號表達式的開始。要匹配 [，請使用 [。
?	匹配前面的子表達式零次或一次，或指明一個非貪婪限定符。要匹配 ? 字符，請使用 ?。
\	將下一個字符標記為或特殊字符、或原義字符、或向后引用、或八進制轉義符。例如， ‘n’ 匹配字符 ‘n’。’\n’ 匹配換行符。序列 ‘\’ 匹配 “”，而 ‘(’ 則匹配 “(”。
^	匹配輸入字符串的開始位置，除非在方括號表達式中使用，當該符號在方括號表達式中使用時，表示不接受該方括號表達式中的字符集合。要匹配 ^ 字符本身，請使用 ^。
{	標記限定符表達式的開始。要匹配 {，請使用 {。
\|	指明兩項之間的一個選擇。要匹配 \|，請使用 \|。

限定符

限定符用來指定正則表達式的一個給定組件必須要出現多少次才能滿足匹配。有 ***** 或 + 或 ? 或 {n} 或 {n,} 或 {n,m} 共6種。

正則表達式的限定符有：

字符描述

*	匹配前面的子表達式零次或多次。例如，zo* 能匹配 “z” 以及 “zoo”。* 等價于{0,}。
+	匹配前面的子表達式一次或多次。例如，‘zo+’ 能匹配 “zo” 以及 “zoo”，但不能匹配 “z”。+ 等價于 {1,}。
?	匹配前面的子表達式零次或一次。例如，“do(es)?” 可以匹配 “do” 、 “does” 中的 “does” 、 “doxy” 中的 “do” 。? 等價于 {0,1}。
{n}	n 是一個非負整數。匹配確定的 n 次。例如，‘o{2}’ 不能匹配 “Bob” 中的 ‘o’，但是能匹配 “food” 中的兩個 o。
{n,}	n 是一個非負整數。至少匹配n 次。例如，‘o{2,}’ 不能匹配 “Bob” 中的 ‘o’，但能匹配 “foooood” 中的所有 o。‘o{1,}’ 等價于 ‘o+’。‘o{0,}’ 則等價于 ‘o*’。
{n,m}	m 和 n 均為非負整數，其中n <= m。最少匹配 n 次且最多匹配 m 次。例如，“o{1,3}” 將匹配 “fooooood” 中的前三個 o。‘o{0,1}’ 等價于 ‘o?’。請注意在逗號和兩個數之間不能有空格。

??此篇文章允許轉載，但是要附上鏈接

總結

以上是生活随笔為你收集整理的详解Python正则表达式基础操作的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： AH8691_POE交换机/分离器IC_
下一篇： python导入os模块_python模