网页爬虫1--正则表达式
生活随笔
收集整理的這篇文章主要介紹了
网页爬虫1--正则表达式
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
教程來源: 莫煩Python
學(xué)習(xí)爬蟲前先了解一下正則表達式吧~
導(dǎo)入模塊
import re #正則表達式模塊簡單匹配
# matching string pattern1="cat" pattern2="bird" string="dog runs to cat" print(pattern1 in string) print(pattern2 in string) True False用正則尋找配對
# regular expression pattern1="cat" pattern2="bird" string="dog runs to cat" print(re.search(pattern1,string)) #顯示匹配到的對象 print(re.search(pattern1,string).group()) #加grop()可以顯示匹配子串 print(re.search(pattern1,string).span()) #加span()顯示匹配到的子串在原字符串中的索引位置 print(re.search(pattern2,string)) <_sre.SRE_Match object at 0x7fde38270b28> cat (12, 15) None匹配多種可能使用[]
# multiple patterns ("run" or "ran") print(re.search(r'r[au]n',"dog runs to cat").group()) run匹配更多種可能
# continue print(re.search(r'r[A-Z]n','dog runs to cat')) print(re.search(r'r[a-z]n','dog runs to cat')) print(re.search(r'r[0-9]n','dog r2ns to cat')) print(re.search(r'r[0-9a-z]n','dog runs to cat')) None <_sre.SRE_Match object at 0x7fde382ab1d0> <_sre.SRE_Match object at 0x7fde382ab1d0> <_sre.SRE_Match object at 0x7fde382ab1d0>特殊種類匹配
數(shù)字
# \d: decimal digit 任何數(shù)字 print(re.search(r'r\dn','run r4n').group()) # \D: any non-decimal digit 不是數(shù)字 print(re.search(r'r\Dn','run r4n').group()) r4n run空白
# \s : any white apace [\t\n\r\f\v] 任何white space print(re.search(r'r\sn','r\nn r4n').group()) # \S : opposite to \s, any none-white space print(re.search(r'r\Sn','r\nn r4n').group()) r n r4n所有字母數(shù)字和"__"
# \w : [a-zA-Z0-9_] 任何大小寫字母,數(shù)字 print(re.search(r'r\wn','r\nn r4n').group()) # \W: opposite to \w print(re.search(r'r\Wn','r\nn r4n').group()) r4n r n空白字符
# \b : empty string (only at the start or end of the world) 空白字符(只在某個字的開頭或結(jié)尾) print(re.search(r'\bruns\b','dog runs to cat').group()) # \B : empty string (but not at the start or end of a world) 空白字符(不在某個字的開頭或結(jié)尾) print(re.search(r'\B runs \B','dog runs to cat').group()) runsruns特殊字符 任意字符
# \\ : match \ 匹配\ print(re.search(r'runs\\','runs\ to me').group()) # . : match anything (except \n) 匹配任何字符(除了\n) print(re.search(r'r.n','r[ns to me]').group()) runs\ r[n句首句尾
# ^ : match line beginning print(re.search(r'^dog','dog runs to cat').group()) # $ : match line ending print(re.search(r'cat$','dog runs to cat').group()) dog None cat是否
# ? : may or may not accur ?前面的字符可有可無 print(re.search(r'Mon(day)?','Monday').group()) print(re.search(r'Mon(day)?','Mon').group()) print(re.search(r'Mon(day)?','Mond').group()) Monday Mon Mon多行匹配
# multi-line string=""" dog runs to cat. I run to dog. """ print(re.search(r'^I',string)) print(re.search(r'^I',string,flags=re.M).group()) #加flags=re.M參數(shù)可以單獨對每一行處理 print(re.search(r'^I',string,flags=re.MULTILINE).group()) None I I0或多次
# * : occur 0 or more times print(re.search(r'ab*','a').group()) print(re.search(r'ab*','abbb').group()) a abbb1或多次
# + : occur 1 or more times print(re.search(r'ab+','a')) print(re.search(r'ab+','abbb').group()) None abbb可選次數(shù)
# {n,m} : occur n to m times print(re.search(r'ab{2,10}','a')) print(re.search(r'ab{2,10}','abbbb').group()) None abbbbgroup組
# group match=re.search(r'(\d+), Data: (.+)','ID: 20180317, Data: Mar/17/2018') print(match.group()) print(match.group(1)) print(match.group(2)) 20180317, Data: Mar/17/2018 20180317 Mar/17/2018 match=re.search(r'(?P<id>\d+), Data: (?P<date>.+)','ID: 20180317, Data: Mar/17/2018') print(match.group('id')) print(match.group('date')) 20180317 Mar/17/2018尋找所有匹配
# findall print(re.findall(r'r[ua]n','run ran ren')) ['run', 'ran'] # | : or #要么是前者,要么是后者 print(re.findall(r'run|ran','run ran ren')) ['run', 'ran']替換
# re.sub() replace print(re.sub(r'r[au]ns','catches','dog runs to cat')) print(re.sub(r'I','You','I like apple')) dog catches to cat You like apple分裂
# re.split() print(re.split(r'[,;\.]','a;b,c.d;e.f')) ['a', 'b', 'c', 'd', 'e', 'f']compile
# compile compiled_re=re.compile(r'r[au]n') print(compiled_re.search('dog ran to cat').group()) ran總結(jié)
以上是生活随笔為你收集整理的网页爬虫1--正则表达式的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Android Binder 之 Ser
- 下一篇: (转)熊绎:我看软件工程师的职业规划