地址合成指令碼實戰

阿新 • • 發佈：2018-12-31

主要的知識點

讀取檔案操作
編碼問題
檔案目錄操作
時間，日期操作
迭代器、生成器
命令列解析
格式化編碼
隨機數

目的：通過編寫一個命令列視窗工具，隨機生成大量的地址Json資料，用於實際的地址資料測試

讀寫檔案操作

Python內建了讀寫檔案的函式，用法和C是相容的。

讀檔案,可以使用try…finally來實現

try:
    f = open('/path/to/file', 'r')
    print f.read()
finally:
    if f:
        f.close()

很麻煩，使用with來代替,會自動的條用close來關閉流

with open('/path/to/file', 'r') as f:
print f.read()

呼叫

f.read() 會一次性的讀取檔案的全部內容。保險起見，
f.read(size)指定快取來控制記憶體的使用
f.readline()可以每次讀取一行內容
readlines()一次讀取所有內容並按行返回list

file-like Object

像open()函式返回的這種有個read()方法的物件，在Python中統稱為file-like Object。除了file外，還可以是記憶體的位元組流，網路流，自定義流

等等。file-like Object不要求從特定類繼承，只要寫個read()方法就行。StringIO就是在記憶體中建立的file-like Object，常用作臨時緩衝。

二進位制檔案

f = open('/Users/michael/test.jpg', 'rb')

在Python中預設的讀取以ASCII編碼格式來讀取，在python中預設讀入記憶體的資料就是ASCII編碼格式，所有的內容都會轉換為ASCII在記憶體中保留

字元編碼

讀取非ASCII編碼的文字檔案，就必須以二進位制模式開啟，再解碼

開啟檔案自動轉碼,(下面程式碼的意思是編碼格式為ASCII格式，如果非ASCII格式需要以rb格式開啟)
```
import codecs
with codecs.open('/Users/michael/gbk.txt', 'r', 'gbk') as f:
   f.read() # u'\u6d4b\u8bd5'\
```

寫檔案

類似的操作都是，只是不同的選項記住

‘w’中表示已覆蓋的的方式寫入
‘a’表示追加的方式寫入
‘+’表示讀寫的方式，如w+
rb：以二進位制讀模式開啟

編碼格式檢測

import chardet.universaldetector
# 編碼檢測
bytes = min(32, os.path.getsize(file_path))
raw = open(file_path, 'rb').read(bytes)
encoding_type = chardet.detect(raw)

返回的encoding_type就是讀取的檔案的格式

Python中的編碼問題

程式碼面前的# -- coding: utf-8 -- 表明，Python 程式碼由 utf-8 編碼。兩個 Python 字串型別間可以用 encode / decode 方法轉換：

編碼 encode
u.encode('utf-8') # 以 utf-8 的格式編碼  ASCII ——> utf-8
解碼 decode 
print s.decode('utf-8')   # 以utf-8的格式解碼  utf-8 ——> ASCII

因為 Python 中認為 16 位的 unicode 才是字元的唯一內碼，而大家常用的字符集如 gb2312，gb18030/gbk，utf-8，以及 ascii 都是字元的二進位制（位元組）編碼形式。把字元從 unicode 轉換成二進位制編碼，當然是要 encode。

在進行同時包含 str 與 unicode 的運算時，Python 一律都把 str 轉換成 unicode 再運算，建議是在程式碼裡的中文字串前寫上 u
```
u = u'關關雎鳩'
```

檔案目錄的操作

這裡會遇到一個問題就在於當我們的路徑中含有中文時，如果編碼的格式沒有處理好，會讓我們無法判斷正確的檔案，可以使用unicode(path,”utf-8”)，將路徑編碼為utf-8格式。常用到的操作檔案、資料夾的模組有兩個os模組和shutil模組

os.getcwd() 得到當前工作目錄，即當前Python指令碼工作的目錄路徑
os.listdir() 返回指定目錄下的所有檔案和目錄名
os.path.split() 返回一個路徑的目錄名和檔名

eg os.path.split('/home/swaroop/byte/code/poem.txt')
結果為 (‘/home/swaroop/byte/code’, ‘poem.txt’)`
os.path.splitext() 分離副檔名
os.path.dirname() 獲取路徑名
os.path.basename() 獲取檔名
os.getenv() 與os.putenv() 讀取和設定環境變數
os.linesep 給出當前平臺使用的行終止符，Windows使用’\r\n’，Linux使用’\n’而Mac使用’\r’
os.name 指示你正在使用的平臺，對於Windows，它是’nt’，而對於Linux/Unix使用者，它是’posix’
os.rename（old， new）重新命名
os.makedirs（r“c：\python\test”）建立多級目錄 os.mkdir（“test”）建立單個目錄
os.stat（file）獲取檔案屬性
os.chmod（file）修改檔案許可權和屬性
os.exit（）終止當前程序
os.path.getsize（filename）獲取檔案大小

用法用例

建立空檔案 os.mknod(“test.txt”)
把檔案裁成規定的大小 fp.truncate([size])，如果按行的話，直接丟到linux伺服器，用wc命令
將檔案打操作標記移到offset的位置
複製檔案
- shutil.copyfile(“oldfile”,”newfile”) #oldfile和newfile都只能是檔案
- shutil.copy(“oldfile”,”newfile”) #oldfile只能是資料夾，newfile可以是檔案，也可以是目標目錄
複製資料夾 shutil.copytree(“olddir”,”newdir”) #olddir和newdir都只能是目錄，且newdir必須不存在
移動檔案或者目錄 shutil.move(“oldpos”,”newpos”)
刪除
- os.remove(“file”) 刪除檔案
- os.rmdir(“dir”) #只能刪除空目錄
- shutil.rmtree(“dir”) #空目錄、有內容的目錄都可以刪
- os.chdir(“path”) #換路徑

遞迴讀取資料夾下的所有檔案

def get_all_path(path, file_list, include_suffix_name=include_file_suffix_name, exclude_file_name=exclude_file_name):
#    path = path.replace("\\", "/")
if os.path.exists(path):
    for sub_path in os.listdir(path):
        m_path = os.path.join(path, sub_path)
        if os.path.isdir(m_path):
            get_all_path(m_path, file_list)
        else:
            # os.path.splitext(sub_path) 分割出檔案與副檔名
            suffix_name = os.path.splitext(sub_path)[-1]
            file_name = os.path.splitext(sub_path)[0]
            if suffix_name in include_suffix_name and file_name not in exclude_file_name:
                file_list.append(m_path)
else:
    print('error! the input path is not dir')

時間和日期的操作

python中有一個庫佳作datetime模組。datetime是模組，datetime模組還包含一個datetime類

獲取當前時間

from datetime import datetime
now = datetime.now() # 獲取當前datetime

按照指定日期和時間

t = datetime(2015, 4, 19, 12, 20) # 用指定日期時間建立datetime

datetime轉換為timestamp，浮點數

計算機中的時間是以相對於1970年1月1日 00:00:00 UTC+00:00時區的時刻來的

>>> from datetime import datetime
>>> dt = datetime(2015, 4, 19, 12, 20) # 用指定日期時間建立datetime
>>> dt.timestamp() # 把datetime轉換為timestamp
1429417200.0

timestamp轉換為datetime 可以使用 datetime.fromtimestamp(t)

str轉換為datetime

以從字串中解析出一個標準的datetime物件

>>> cday = datetime.strptime('2015-6-1 18:19:59', '%Y-%m-%d %H:%M:%S')
>>> print(cday)
2015-06-01 18:19:59

datetime轉換為str

主要的是其中的日期表示，%Y-%m-%d %H:%M:%S表示年、月、天、時、分、秒,如下可以看出

now.strftime('%a, %b %d %H:%M')  # Mon, May 05 16:28

datetime加減

匯入timedelta這個類，可以將時間日期格式按照+、- 來操作

>>> from datetime import datetime, timedelta
>>> now = datetime.now()
>>> now
datetime.datetime(2015, 5, 18, 16, 57, 3, 540997)
>>> now + timedelta(hours=10)
datetime.datetime(2015, 5, 19, 2, 57, 3, 540997)
>>> now - timedelta(days=1)
datetime.datetime(2015, 5, 17, 16, 57, 3, 540997)
>>> now + timedelta(days=2, hours=12)
datetime.datetime(2015, 5, 21, 4, 57, 3, 540997)

本地時間轉換為UTC時間

一個datetime型別有一個時區屬性tzinfo，但是預設為None，所以無法區分這個datetime到底是哪個時區

>>> from datetime import datetime, timedelta, timezone
>>> tz_utc_8 = timezone(timedelta(hours=8)) # 建立時區UTC+8:00
>>> now = datetime.now()
>>> now
datetime.datetime(2015, 5, 18, 17, 2, 10, 871012)
>>> dt = now.replace(tzinfo=tz_utc_8) # 強制設定為UTC+8:00
>>> dt
datetime.datetime(2015, 5, 18, 17, 2, 10, 871012, tzinfo=datetime.timezone(datetime.timedelta(0, 28800)))

時區轉換

先通過utcnow()拿到當前的UTC時間，再轉換為任意時區的時間

# 拿到UTC時間，並強制設定時區為UTC+0:00:
>>> utc_dt = datetime.utcnow().replace(tzinfo=timezone.utc)
>>> print(utc_dt)
2015-05-18 09:05:12.377316+00:00
# astimezone()將轉換時區為北京時間:
>>> bj_dt = utc_dt.astimezone(timezone(timedelta(hours=8)))
>>> print(bj_dt)
2015-05-18 17:05:12.377316+08:00
# astimezone()將轉換時區為東京時間:
>>> tokyo_dt = utc_dt.astimezone(timezone(timedelta(hours=9)))
>>> print(tokyo_dt)
2015-05-18 18:05:12.377316+09:00
# astimezone()將bj_dt轉換時區為東京時間:
>>> tokyo_dt2 = bj_dt.astimezone(timezone(timedelta(hours=9)))
>>> print(tokyo_dt2)
2015-05-18 18:05:12.377316+09:00

迭代器和生成器

生成器 generator

generator儲存的是演算法，每次呼叫next(g)，就計算出g的下一個元素的值，直到計算到最後一個元素，沒有更多的元素時，丟擲StopIteration的錯誤

第一種方法很簡單，只要把一個列表生成式的[]改成()，eg >> g = (x*x for x in range(1,10))
包含yield關鍵詞，一個函式定義中包含yield關鍵字，那麼這個函式就不再是一個普通函式，而是一個generator。

呼叫next(g)就可以每次之生成指定想要的資料。也可以用使用for來迭代。比如，著名的斐波拉契數列（Fibonacci），除第一個和第二個數外，任意一個數都可由前兩個數相加得到：

def fib(max):
    n, a, b = 0, 0, 1
    while n < max:
        yield b
        a, b = b, a + b
        n = n + 1
    return 'done'

可迭代物件和迭代器

迭代器一定是可迭代物件，

map和 reduce函式

這兩個函式非常的好用，對於兩個等長度型別的列表A和B，

map()函式接收兩個引數，一個是函式，一個是Iterable，map將傳入的函式依次作用到序列的每個元素，並把結果作為新的Iterator返回
```
>>> def f(x):
... return x * x
...
```
```
>>> r = map(f, [1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> list(r)
[1, 4, 9, 16, 25, 36, 49, 64, 81]
```
當然 f可以使用lambda表示式來表示, f = lambda x, x*x
reduce把一個函式作用在一個序列[x1, x2, x3, …]上，這個函式必須接收兩個引數，reduce把結果繼續和序列的下一個元素做累積計算
```
reduce(f, [x1, x2, x3, x4]) = f(f(f(x1, x2), x3), x4)
```

命令列解析

從網上來看，有兩種方法在看來比較優，一種是基於argparse解析器，一種是docopt。其中argparse是基於標準的python庫的，而docpoct雖然很強大，但是需要額外的安裝一些內容。

使用 argparse

主要的步驟有三個，1. 生成一個argparse解析物件 2. 增加引數 3.解析引數

建立一個解析器

使用用法就是 parser = argparse.ArgumentParser(description='Process some integers.')，其中ArgumentParser物件中就儲存了命令列解析成python資料型別的所有資訊。下面為官方的幫助文件

Definition : ArgumentParser(...)

Type : Function of argparse module

Object for parsing command line strings into Python objects.

Keyword Arguments:
prog – The name of the program (default: sys.argv[0])  程式的名字（預設：sys.argv[0]，在幫助資訊中都可以使用%(prog)s格式符得到程式的名字
usage – A usage message (default: auto-generated from arguments) 用法，可自動從引數中生成
description – A description of what the program does 
epilog – Text following the argument descriptions
parents – Parsers whose arguments should be copied into this one
formatter_class – HelpFormatter class for printing help messages
prefix_chars – Characters that prefix optional arguments 可選引數的字首字符集（預設：‘-‘）
fromfile_prefix_chars – Characters that prefix files containing 
additional arguments 額外的引數應該讀取的檔案的字首字符集（預設：None）
argument_default – The default value for all arguments
conflict_handler – String indicating how to handle conflicts
add_help – Add a -h/-help option 給解析器新增-h/–help 選項（預設：True）
allow_abbrev – Allow long options to be abbreviated unambiguously

增加引數

ArgumentParser.add_argument(name or flags...[, action][, nargs][, const][, default][, type][, choices][, required][, help][, metavar][, dest])
定義應該如何解析一個命令列引數。下面每個引數有它們自己詳細的描述，簡單地講它們是：

name or flags - 選項字串的名字或者列表，例如foo 或者-f, --foo。
action - 在命令列遇到該引數時採取的基本動作型別。
nargs - 應該讀取的命令列引數數目。
const - 某些action和nargs選項要求的常數值。
default - 如果命令列中沒有出現該引數時的預設值。
type - 命令列引數應該被轉換成的型別。
choices - 引數可允許的值的一個容器。
required - 該命令列選項是否可以省略（只針對可選引數）。
help - 引數的簡短描述。
metavar - 引數在幫助資訊中的名字。
dest - 給parse_args()返回的物件要新增的屬性名稱。

解析引數

預設情況下，引數字串取自於sys.argv，並建立一個空的Namespace物件用於儲存屬性。

格式化編碼

字串物件的 str.rjust() 方法的作用是將字串靠右，並預設在左邊填充空格，類似的方法還有 str.ljust() 和 str.center()
str.format()比較常用，使用{}佔位符，可選項':'和格式識別符號可以跟著 field name，這樣可以進行更好的格式化

python print('We are the {} who say "{}!"'.format('knights', 'Ni')) # 使用了關鍵字引數 print('{0} and {1}'.format('Kobe', 'James')) print('The {thing} is {adj}.'.format(thing='flower', adj='beautiful')) # 指定寬度格式化 print('The value of PI is {0:.3f}.'.format(math.pi))
% 操作符也可以實現字串格式化。它將左邊的引數作為類似 sprintf() 式的格式化字串，而將右邊的代入 print('The value of PI is %10.3f.' %math.pi)

隨機數

print("隨機生成一個0到99之間的數字：",random.randint(0,99))
print("隨機生成一個1到100之間的偶數：",random.randrange(0,101,2))
print("實現一個0到1之間的浮點數:",random.random())
a,b=0,5
print("隨機生成(a,b)之間的浮點數：",random.uniform(a,b))
print("隨機字元：在指定的範圍內隨機取樣一個:",random.choice(range(1,3)))
print("隨機選取字串：在指定範圍內隨機你隨機取樣多個:",random.choices(['one','two','three','four']))
print("多個字串中選取特定的字元:",random.sample('abcdefghijk',3))

參考：
廖雪峰Python教程