Python基礎----正則表達式爬蟲應用，configparser模塊和subprocess模塊

阿新 • • 發佈：2017-07-04

stdin alt 輸入 -h 但是狀態 swd 有效 tle

正則表達式爬蟲應用（校花網） 技術分享

 1 import requests
 2 import re
 3 import json
 4 #定義函數返回網頁的字符串信息
 5 def getPage_str(url):
 6     page_string=requests.get(url)
 7     return page_string.text
 8 
 9 hua_dic={}
10 def run_re(url):　　#爬取名字、學校和喜愛的人數
11     hua_str=getPage_str(url)
12     hua_list=re.finditer(‘<span class="price">(?P<name>.*?)</span>.*?class="img_album_btn">(?P<school>.*?)</a>.*?<em class.*?>(?P<like>\d+?)</em>‘,hua_str,re.S)
13     for n in hua_list:　　　　#將名字、學校和喜愛的人數寫入字典
14         hua_dic[n.group(‘name‘)]=[n.group(‘school‘),n.group(‘like‘)]
15 
16 def url():　　#獲取url地址
17     for i in range(0,43):
18         urls="http://www.xiaohuar.com/list-1-%s.html" %i
19         yield urls
20 #執行爬取內容
21 for i in url():
22     run_re(i)
23 
24 print(hua_dic)
25 
26 # with open(‘aaa‘,‘w‘,encoding=‘utf-8‘) as f:
27 #     f.write(str(hua_dic))
28 data=json.dumps(hua_dic)　　#將爬取的字典進行序列化操作
29 print(data)
30 f=open(‘hua.json‘,‘a‘)
31 f.write(data)
32 #反序列化
33 # f1=open(‘hua.json‘,‘r‘)
34 # new_data=json.load(f1)
35 # print(new_data)

configparser模塊

該模塊適用於linux下conf配置文件的格式與windows ini文件類似，可以包含一個或多個節（section），每個節可以有多個參數（鍵=值）。

如：

[DEFAULT]
ServerAliveInterval = 45
Compression = yes
CompressionLevel = 9
ForwardX11 = yes
  
[bitbucket.org]
User = hg
  
[topsecret.server.com]
Port = 50022
ForwardX11 = no

生成文件示例：

 1 import configparser
 2 
 3 config = configparser.ConfigParser()　　#定義一個對象
 4 
 5 config["DEFAULT"] = {‘ServerAliveInterval‘: ‘45‘,　　#定義DEFAULT節的鍵值對信息，DEFAULT節是一個特殊的節，在其他的節裏都包含DEFAULT節的內容
 6                       ‘Compression‘: ‘yes‘,
 7                      ‘CompressionLevel‘: ‘9‘,
 8                      ‘ForwardX11‘:‘yes‘
 9                      }
10 
11 config[‘bitbucket.org‘] = {‘User‘:‘hg‘}　　#普通的節
12 
13 config[‘topsecret.server.com‘] = {‘Host Port‘:‘5022‘,‘ForwardX11‘:‘no‘}　　#普通的節
14 
15 with open(‘example.ini‘, ‘w‘) as configfile:　　#寫入文件
16     config.write(configfile)

查找文件內容：

 1 import configparser
 2 
 3 config = configparser.ConfigParser()
 4 #--------------------------查找文件內容,基於字典的形
 5 print(config.sections())        #  []
 6 config.read(‘example.ini‘)
 7 print(config.sections())        #   [‘bitbucket.org‘, ‘topsecret.server.com‘]
 8 print(‘bytebong.com‘ in config) # False
 9 print(‘bitbucket.org‘ in config) # True
10 
11 print(config[‘bitbucket.org‘]["user"])  # hg
12 print(config[‘DEFAULT‘][‘Compression‘]) #yes
13 print(config[‘topsecret.server.com‘][‘ForwardX11‘])  #no
14 print(config[‘bitbucket.org‘])          #<Section: bitbucket.org>
15 for key in config[‘bitbucket.org‘]:     # 註意,有default會默認default的鍵
16     print(key)
17 print(config.options(‘bitbucket.org‘))  # 同for循環,找到‘bitbucket.org‘下所有鍵
18 print(config.items(‘bitbucket.org‘))    #找到‘bitbucket.org‘下所有鍵值對
19 print(config.get(‘bitbucket.org‘,‘compression‘)) # yes       get方法取深層嵌套的值

subprocess模塊

當我們需要調用系統的命令的時候，最先考慮的os模塊。用os.system()和os.popen()來進行操作。但是這兩個命令過於簡單，不能完成一些復雜的操作，如給運行的命令提供輸入或者讀取命令的輸出，判斷該命令的運行狀態，管理多個命令的並行等等。這時subprocess中的Popen命令就能有效的完成我們需要的操作。

subprocess模塊允許一個進程創建一個新的子進程，通過管道連接到子進程的stdin/stdout/stderr，獲取子進程的返回值等操作。這個模塊只一個類：Popen。 簡單命令 技術分享

1 import subprocess
2 #  創建一個新的進程,與主進程不同步  if in win: 
3 s=subprocess.Popen(‘dir‘,shell=True)
4 #  創建一個新的進程,與主進程不同步  if in linux: 
5 s=subprocess.Popen(‘ls‘)
6 s.wait()                  # s是Popen的一個實例對象，意思是等待子進程運行完後才繼續運行
7 print(‘ending...‘)

帶選項命令（win、linux一樣）

1 import subprocess
2 subprocess.Popen(‘ls -l‘,shell=True)
3 #subprocess.Popen([‘ls‘,‘-l‘])

控制子進程

1 s.poll() # 檢查子進程狀態
2 s.kill() # 終止子進程
3 s.send_signal() # 向子進程發送信號
4 s.terminate() # 終止子進程
5 s.pid:子進程號

子進程輸出流控制

可以在Popen()建立子進程的時候改變標準輸入、標準輸出和標準錯誤，並可以利用subprocess.PIPE將多個子進程的輸入和輸出連接在一起，構成管道(pipe)：

 1 import subprocess
 2 # s1 = subprocess.Popen(["ls","-l"], stdout=subprocess.PIPE)
 3 # print(s1.stdout.read())
 4 #s2.communicate()
 5 s1 = subprocess.Popen(["cat","/etc/passwd"], stdout=subprocess.PIPE)
 6 s2 = subprocess.Popen(["grep","0:0"],stdin=s1.stdout, stdout=subprocess.PIPE)
 7 out = s2.communicate()
 8 print(out)
 9 
10 s=subprocess.Popen("dir",shell=True,stdout=subprocess.PIPE)
11 print(s.stdout.read().decode("gbk"))

ubprocess.PIPE實際上為文本流提供一個緩存區。s1的stdout將文本輸出到緩存區，隨後s2的stdin從該PIPE中將文本讀取走。s2的輸出文本也被存放在PIPE中，直到communicate()方法從PIPE中讀取出PIPE中的文本。
註意：communicate()是Popen對象的一個方法，該方法會阻塞父進程，直到子進程完成

Python基礎----正則表達式爬蟲應用，configparser模塊和subprocess模塊

stdin alt 輸入 -h 但是狀態 swd 有效 tle 正則表達式爬蟲應用（校花網） 1 import requests 2 import re 3 import json 4 #定義函數返回網頁的字符串信息 5 def getPage_str(u

Python基礎----正則表達式爬蟲應用，configparser模塊和subprocess模塊

Python基礎----正則表達式爬蟲應用，configparser模塊和subprocess模塊

Python開發基礎-Day15正則表達式爬蟲應用，configparser模塊和subprocess模塊

Python基礎----正則表達式和re模塊

python 正則表達式中反斜杠()的麻煩和陷阱（轉）

Python中正則表達式（re模塊）的使用

【Python】正則表達式1（未完）

Python練習---正則表達式

python 使用正則表達式判斷密碼強弱

Python 使用正則表達式

Linux之正則表達式的應用

基礎正則表達式

Python re正則表達式速查

[python 學習]正則表達式

Java基礎——正則表達式

Python實現正則表達式匹配任意的郵箱

python re正則表達式基本使用介紹

Python與正則表達式

python使用正則表達式

python 中正則表達式的使用

基礎正則表達式學習筆記

Python基礎----正則表達式爬蟲應用，configparser模塊和subprocess模塊

相關推薦