超基礎的用Python處理文字例項

阿新 • • 發佈：2019-02-09

最近在進行一些實驗，需要進行文字處理，提取文字中關鍵的欄位資料，得到表格，進行分析。在此簡要的進行記錄。

一、需求是這樣的：

得到的GPGPU-Sim執行的程式文字文件。那麼我現在需要提取目標對應的鍵值。比如文字中有如下： A1 = B1 A2 = B2 A3 = B3 ..... A5 = B5 我現在需要提取出A2和A5對應的鍵值B2以及B5，按照"B2 B5"這樣的格式寫入到文字中去。如何用Python程式碼來實現？ 需要提取的欄位為：

'gpu_sim_insn',
'gpu_ipc',
'L1I_total_cache_accesses',
'L1D_total_cache_accesses',
'gpgpu_n_tot_thrd_icount',
'gpgpu_n_tot_w_icount',
'gpgpu_n_mem_read_local',
'gpgpu_n_mem_write_local',
'gpgpu_n_mem_read_global',
'gpgpu_n_mem_write_global',
'gpgpu_n_mem_texture',
'gpgpu_n_mem_const',
'gpgpu_n_load_insn',
'gpgpu_n_store_insn',
'gpgpu_n_shmem_insn',
'gpgpu_n_tex_insn',
'gpgpu_n_const_mem_insn',
'gpgpu_n_param_mem_insn'

程式碼如下：

import re
import sys
import os,glob

#定義目錄：目錄下有多個檔案需要處理
path = 'D:\\GPUClusters\\Stargazer-master\\EXP_RESULT'
#定義輸出檔案
fout = open("res.txt",'w')

x = [
     'gpu_sim_insn',
     'gpu_ipc',
     'L1I_total_cache_accesses',
     'L1D_total_cache_accesses',
     'gpgpu_n_tot_thrd_icount',
     'gpgpu_n_tot_w_icount',
     'gpgpu_n_mem_read_local',
     'gpgpu_n_mem_write_local',
     'gpgpu_n_mem_read_global',
     'gpgpu_n_mem_write_global',
     'gpgpu_n_mem_texture',
     'gpgpu_n_mem_const',
     'gpgpu_n_load_insn',
     'gpgpu_n_store_insn',
     'gpgpu_n_shmem_insn',
     'gpgpu_n_tex_insn',
     'gpgpu_n_const_mem_insn',
     'gpgpu_n_param_mem_insn'
     ]

#改變路徑
os.chdir(path)

#遍歷目錄下的所有檔案
for filename in os.listdir():
    fs = open(filename,'r+')
	#處理檔案中的每一行資料
    for line in fs.readlines():
        a = line.split()
        if a != [] and a[0] in x:
            fout.write(a[-1]+'\t')
            if a[0] == 'gpgpu_n_param_mem_insn':
                fout.write('\n')
                break
                
fout.write('\n')  
fout.close()

解釋一下程式碼中的幾個問題： 1.在一個目錄下有多個檔案，每個檔案都要讀取一次，並進行文字處理，如何實現？

#比如d:\work下面是你要讀取的檔案，程式碼可以這樣寫:
import os
path = 'd:\\work' #or path = r'd:\work'
os.chdir(path)
for filename in os.listdir():
    file = open(filename,'r')
    for eachline in file.readlines():
        #process eachline

2.Python中.read(), .readline(), .readlines()區別？

Python 將文字檔案的內容讀入可以操作的字串變數非常容易。檔案物件提供了三個“讀”方法： .read()、.readline() 和 .readlines()。每種方法可以接受一個變數以限制每次讀取的資料量，但它們通常不使用變數。 .read() 每次讀取整個檔案，它通常用於將檔案內容放到一個字串變數中。然而 .read() 生成檔案內容最直接的字串表示，但對於連續的面向行的處理，它卻是不必要的，並且如果檔案大於可用記憶體，則不可能實現這種處理。

.readline() 和 .readlines() 非常相似。它們都在類似於以下的結構中使用：

Python .readlines() 示例

fh = open('c:\\autoexec.bat')
for line in fh.readlines():
    print line

.readline() 和 .readlines() 之間的差異是後者一次讀取整個檔案，象 .read() 一樣。.readlines() 自動將檔案內容分析成一個行的列表，該列表可以由 Python 的 for ... in ... 結構進行處理。另一方面，.readline() 每次只讀取一行，通常比 .readlines() 慢得多。僅當沒有足夠記憶體可以一次讀取整個檔案時，才應該使用 .readline()。 3.split方法：http://www.w3cschool.cc/python/att-string-split.html 二、再舉一個簡單的例子：有如下文字"record.txt":

boy:what's your name?
girl:my name is lebaishi,what about you?
boy:my name is wahaha.
girl:i like your name.
==============================================
girl:how old are you?
boy:I'm 16 years old,and you?
girl:I'm 14.what is your favorite color?
boy:My favorite is orange.
girl:I like orange too!
==============================================
boy:where do you come from?
girl:I come from SH.
boy:My home is not far from you,I live in Jiangsu province.
girl:Let's be good friends.
boy:OK!

需求：將檔案（record.txt）中的資料進行分割並按照以下規律儲存起來： --boy的對話單獨儲存為boy_*.txt的檔案（去掉"boy:"） --girl的對話單獨儲存為girl_*.txt的檔案（去掉"girl:"） --檔案中總共有三段對話，分別儲存為boy_1.txt,girl_1.txt,boy_2.txt,girl_2.txt,boy_3.txt,girl_3.txt共六個檔案（檔案中的不同的對話已經用"======="分割）。程式碼：

boy_log = []
girl_log = []
version = 1

def save_to_file(boy_log,girl_log,version):
    filename_boy = 'boy_' + str(version) + ".txt"
    filename_girl = 'girl_' + str(version)  + ".txt"
    fb = open(filename_boy,"w")
    fg = open(filename_girl,"w")
    fb.writelines(boy_log)
    fg.writelines(girl_log)
            
    fb.close()
    fg.close()

def process(filename):
    file = open(filename,"r")
    for eachline in file.readlines():
        if eachline[:6] != "======":
            mylist = eachline.split(":")
            if mylist[0] == "boy":
                global boy_log
                boy_log.append(mylist[-1])
            else:
                global girl_log
                girl_log.append(mylist[-1])
        else:
            global version
            save_to_file(boy_log,girl_log,version)
            version += 1
            boy_log = []
            girl_log = []
            
    save_to_file(boy_log,girl_log,version)

if __name__ == "__main__":
    fn = "record.txt"
    process(fn)

兩個例子都是非常基礎也很使用的，記錄下來以便以後查閱。再來一個簡單的需求，我需要獲取Linux上的ipv4的eth0地址，程式碼如下：

#/usr/bin/python

import sys
import os

os.system("ifconfig > ip.info")

fs = open("ip.info",'r+')

flag = 0

def get_ip():
	for line in fs.readlines():
		a = line.split()
		if a != [] and a[0] == "eth0":
			flag = 1
		if a != [] and a[0] == "lo":
			flag = 0

		if flag == 0:
			continue
		else:
			for item in a:
				if a[0] == "inet" and item[0:5] == "addr:":
					return item[5:]

ip = get_ip()
print ip

超基礎的用Python處理文字例項

超基礎的用Python處理文字例項

用Python處理文字——刪除.txt每行中的不必要字元

用python處理時間、utf8文字、正則匹配、序列化、目錄路徑搜尋、xml解析

用python讀取文字資訊，進行處理，寫到另一檔案中

用python提取文字中的數字, 文字複製

零基礎學習 Python 之類的例項

用 Python 處理 HTML 轉義字元的5種方式

超好用的富文字編輯器froalaEditor（方便傳圖片和視訊等）

學會用Python處理Excel文件，萬行Excel資料隨便解決！

用Python統計文字檔案中詞彙字母短語等分佈

用Python處理非平穩時間序列（附程式碼）

想用python處理PDF怎麼辦？

python處理文字使用n-gram方法

用python 處理丟包log的數字

用python處理圖片---通道轉換、裁剪與幾何變換

用python處理圖片---單通道變多通道

用Python處理HTML轉義字元的5種方式

用python處理Excel文件（2）——用xlsxwriter模組寫xls/xlsx文件

用Python處理"大"XLS檔案

用python處理excel檔案(1)

超基礎的用Python處理文字例項

相關推薦