Python對HDFS的操作(一)

阿新 • • 發佈：2018-12-17

HDFS

hdfs的定義:

　　Hadoop的分散式檔案系統（HDFS）被設計成適合執行通用硬體上的分散式檔案系統，它和現有的分散式檔案系統有很多的共同點。但同時，它和其它的分散式檔案系統的區別也是很明顯的，hdfs是一個高容錯性的系統，適合部署在廉價的機器上。HDFS能提供高吞吐量的資料訪問，非常適合大規模資料集上使用。HDFS放寬了一部分POSIX(https://baike.baidu.com/item/POSIX/3792413?fr=aladdin)約束,來實現流式讀取檔案系統資料的目的。HDFS在最開始是作為Apache Nutch搜尋引擎專案的基礎架構而開發的。HDFS現在是Apache Hadoop Core專案的一部分，這個專案的地址是：http://hadoop.apache.org/core/。

安裝：

　　因為這裡介紹的是python版本的使用，所以需要安裝相應的包：

　pip install hdfs

基礎使用方法：

　　Client---建立叢集連結:

from hdfs import *
client=Client("https://hdfsip:50070")

　　引數說明：

　　　　class hdfs.client.Client(url, root=None, proxy=None, timeout=None, session=None)

　　　　　　url: ip:埠

　　　　　 root：指定hdfs根目錄

　　　　　　proxy：制定登陸使用者的身份

　　　　　　timout：設定超時時間

　　　　　　session：(官方解釋，沒弄明白具體意思，也暫時沒有用到這個引數，等用到之後再進行補充)

　　dir---檢視Client所有支援的方法：

>>> dir(client)>>>['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__',

　　'__hash__', '__init__', '__le__', '__lt__', '__module__'

, '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__registry__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_append', '_create', '_delete', '_get_content_summary', '_get_file_checksum', '_get_file_status', '_get_home_directory', '_list_status', '_mkdirs', '_open', '_proxy', '_rename', '_request', '_session', '_set_owner', '_set_permission', '_set_replication', '_set_times', '_timeout', 　　'checksum', 'content', 'delete', 'download', 'from_options', 'list', 'makedirs', 'parts', 'read', 'rename', 'resolve', 'root', 'set_owner', 'set_permission', 'set_replication', 'set_times', 'status', 'upload', 'url', 'walk', 'write']

　　status---獲取指定路徑的具體資訊：

>>> client.status("/")

{'accessTime': 0, 'pathSuffix': '', 'group': 'supergroup', 'type': 'DIRECTORY', 'owner': 'root', 'childrenNum': 4, 'blockSize': 0,

 'fileId': 16385, 'length': 0, 'replication': 0, 'storagePolicy': 0, 'modificationTime': 1473023149031, 'permission': '777'}

　　引數說明：

　　　　status(hdfs_path,strict=True)

　　　　hdfs_path:就是hdfs的路徑

　　　　strict：當設定為True時，hdfs的路徑不存在時，返回異常資訊

　　　　　　　當設定為False時，hdfs的路徑不存在時，返回None

　　list---獲取指定路徑的子目錄資訊：

client.list("/")
["test01","test02","test03"]

引數說明:

　　　　list(hdfs_path,status=False)

　　　　hdfs_path: hdfs的路徑

　　　　status：為True時，同時反hi子目錄的狀態資訊，預設為False

　　makedirs--建立目錄

client.makedirs("test06")

　　引數說明：

　　　　makedirs(hdfs_path,permission=None)

　　　　hdfs_path: 要建立目錄

　　　　permission：對建立的資料夾設定許可權

　　rename—重新命名

>>> client.rename("/test","/new_name")
>>> client.list("/")
['file', 'gyt', 'hbase', 'new_name', 'tmp']

　　引數說明：

　　　　rename(hdfs_path, local_path）

　delete—刪除

>>> client.list("/")
['file', 'gyt', 'hbase', 'new_name', 'tmp']
>>> client.delete("/new_name")
True
>>> client.list("/")
['file', 'gyt', 'hbase', 'tmp']

　　引數說明：　　　　delete(hdfs_path,recursive=False)

　　　　recursive：刪除檔案和其子目錄，設定為False如果不存在，則會丟擲異常，預設為False

　upload——上傳資料

>>> client.list("/")
[u'hbase', u'test']
>>> client.upload("/test","/opt/bigdata/hadoop/NOTICE.txt")
'/test/NOTICE.txt'
>>> client.list("/")
[u'hbase', u'test']
>>> client.list("/test")
[u'NOTICE.txt']

　　引數說明：

　　　　upload(hdfs_path,local_path,overwrite=False,n_threads=1,temp_dir=None,

　　　　chunk_size=65536,progress=None,cleanup=True,**kwargs)

　　　　overwrite：是否是覆蓋性上傳檔案

　　　　n_threads：啟動的執行緒數目

　　　　temp_dir：當overwrite=true時，遠端檔案一旦存在，則會在上傳完之後進行交換

　　　　chunk_size：檔案上傳的大小區間

　　　　progress：回撥函式來跟蹤進度，為每一chunk_size位元組。它將傳遞兩個引數，檔案上傳的路徑和傳輸的位元組數。一旦完成，-1將作為第二個引數

　　　　cleanup：如果在上傳任何檔案時發生錯誤，則刪除該檔案

　　download——下載

>>> client.download("/test/NOTICE.txt","/home")
'/home/NOTICE.txt'
>>> import os
>>> os.system("ls /home")
lost+found  NOTICE.txt  thinkgamer
0

　　引數說明：

　　　　download(hdfs_path,local_path,overwrite=False,n_threads=1,temp_dir=None,**kwargs)

　　　　和上傳的引數一樣

　　read——讀取檔案

>>> with client.read("/test/NOTICE.txt") as reader:
...     print reader.read()
...
This product includes software developed by The Apache Software
Foundation (https://www.apache.org/).
 
>>>

引數說明：

　　read(*args,**kwds)

　　hdfs_path：hdfs路徑

　　offset：設定開始的位元組位置

　　length：讀取的長度（位元組為單位）

　　buffer_size：用於傳輸資料的位元組的緩衝區的大小。預設值設定在HDFS配置。

　　encoding：制定編碼

　　chunk_size：如果設定為正數，上下文管理器將返回一個發生器產生的每一chunk_size位元組而不是一個類似檔案的物件

　　delimiter：如果設定，上下文管理器將返回一個發生器產生每次遇到分隔符。此引數要求指定的編碼。

　　progress：回撥函式來跟蹤進度，為每一chunk_size位元組（不可用，如果塊大小不是指定）。它將傳遞兩個引數，檔案上傳的路徑和傳輸的位元組數。稱為一次與- 1作為第二個引數。

我在測試的時候遇見這個錯誤：

hdfs.util.HdfsError: Permission denied: user=dr.who, access=WRITE, inode="/test":root:supergroup:drwxr-xr-x

找到的解決辦法是：

　　解決辦法是：在配置檔案hdfs-site.xml中加入

<property>
  <name>dfs.permissions</name>
  <value>false</value>
</property>

重啟叢集

Python對HDFS的操作(一)

HDFS

hdfs的定義:

安裝：

基礎使用方法：

Client---建立叢集連結:

dir---檢視Client所有支援的方法：

status---獲取指定路徑的具體資訊：

list---獲取指定路徑的子目錄資訊：

makedirs--建立目錄

rename—重新命名

delete—刪除

upload——上傳資料

download——下載

read——讀取檔案

有點長了，webhdfs在下一篇中寫

Python對HDFS和WEBHDFS的操作(一)

Python對HDFS的操作(一)

python對excel操作

python文件操作一

Python 對象學習一

python對excel操作加分系統學生加分

python對Mysql操作和使用ORM框架（SQLAlchemy）

混淆，加固，重簽名，對齊操作一趟串

如何使用 Python 對 Excel 做一份數據透視表

python對json的操作總結(一)

python 之pydhfs 對hdfs 進行操作

python實現對HDFS的檔案操作

Python對Linux系統的操作模塊

Python －面向對象（一基本概念）

python 文件操作--內置對象open

Python 對Mysql的操作

selenium +python 對table的操作

Python—對Excel進行讀寫操作

Python不歸路_文件操作(一)

python對ftp進行操作

Python對HDFS的操作(一)

HDFS

hdfs的定義:

安裝：

基礎使用方法：

Client---建立叢集連結:

dir---檢視Client所有支援的方法：

status---獲取指定路徑的具體資訊：

list---獲取指定路徑的子目錄資訊：

makedirs--建立目錄

rename—重新命名

delete—刪除

upload——上傳資料

download——下載

read——讀取檔案

有點長了，webhdfs在下一篇中寫

相關推薦

　　Client---建立叢集連結:

　　dir---檢視Client所有支援的方法：

　　status---獲取指定路徑的具體資訊：

　　list---獲取指定路徑的子目錄資訊：

　　makedirs--建立目錄

　　rename—重新命名

　delete—刪除

　upload——上傳資料

　　download——下載

　　read——讀取檔案