1. 程式人生 > 其它 >Structured data representation of python

Structured data representation of python

Structured data

https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html

結構化資料 -- 在資料上定義了一層模式, 例如關係型資料庫

非結構資料 -- 自由形式資料, 沒有任何約束, 例如報紙新聞

半結構化資料 -- 沒有全域性的資料模式, 但是對於每一條資料都有自身的模式定義, 例如文件資料庫。

在python應用中往往需要定義結構化資料,來管理業務資料。本文總結幾種結構化資料儲存方法。

Structured data

Structured data sources define a schema on the data. With this extra bit of information about the underlying data, structured data sources provide efficient storage and performance. For example, columnar formats such as Parquet and ORC make it much easier to extract values from a subset of columns. Reading each record row by row first, then extracting the values from the specific columns of interest can read much more data than what is necessary when a query is only interested in a small fraction of the columns. A row-based storage format such as Avro efficiently serializes and stores data providing storage benefits. However, these advantages often come at the cost of flexibility. For example, because of rigidity in structure, evolving a schema can be challenging.

Unstructured data

By contrast, unstructured data sources are generally free-form text or binary objects that contain no markup, or metadata (e.g., commas in CSV files), to define the organization of data. Newspaper articles, medical records, image blobs, application logs are often treated as unstructured data. These sorts of sources generally require context around the data to be parseable. That is, you need to know that the file is an image or is a newspaper article. Most sources of data are unstructured. The cost of having unstructured formats is that it becomes cumbersome to extract value out of these data sources as many transformations and feature extraction techniques are required to interpret these

datasets.

Semi-structured data

Semi-structured data sources are structured per record but don’t necessarily have a well-defined global schema spanning all records. As a result, each data record is augmented with its schema information. JSON and XML are popular examples. The benefits of semi-structured data formats are that they provide the most flexibility in expressing your data as each record is self-describing. These formats are very common across many applications as many lightweight parsers exist for dealing with these records, and they also have the benefit of being human readable. However, the main drawback for these formats is that they incur extra parsing overheads, and are not particularly built for ad-hoc querying.

Dict

https://docs.python.org/3/tutorial/datastructures.html#dictionaries

實際上沒有模式定義, 需要開發者使用的時候按照需求列舉出各個fields。

>>> tel = {'jack': 4098, 'sape': 4139}
>>> tel['guido'] = 4127
>>> tel
{'jack': 4098, 'sape': 4139, 'guido': 4127}
>>> tel['jack']
4098

namedtuple

https://medium.com/swlh/structures-in-python-ed199411b3e1

命名元組, 定義的元組各個位置的應用名字, 並可以使用名字來索引元素。

from collections import namedtuple 
Point = namedtuple('Point', ['x', 'y'])


Point = namedtuple('Point', ['x', 'y'], defaults=[0, 0])



ntpt = Point(3, y=6)



ntpt.x + ntpt.y



ntpt[0] + ntpt[1]

class

https://docs.python.org/3/tutorial/classes.html#class-objects

使用class管理複合資料屬性。

>>> class Complex:
...     def __init__(self, realpart, imagpart):
...         self.r = realpart
...         self.i = imagpart
...
>>> x = Complex(3.0, -4.5)
>>> x.r, x.i
(3.0, -4.5)

dataclass

https://www.geeksforgeeks.org/understanding-python-dataclasses/

dataclass在class的基礎上做了增強,專門面向資料儲存, 包括初始化, 列印, 和比較。

DataClasses has been added in a recent addition in python 3.7 as a utility tool for storing data. DataClasses provides a decorator and functions for automatically adding generated special methods such as __init__() , __repr__() and __eq__() to user-defined classes.

# default field example
from dataclasses import dataclass, field


# A class for holding an employees content
@dataclass
class employee:

    # Attributes Declaration
    # using Type Hints
    name: str
    emp_id: str
    age: int
    
    # default field set
    # city : str = "patna"
    city: str = field(default="patna")


emp = employee("Satyam", "ksatyam858", 21)
print(emp)

pydantic

https://pydantic-docs.helpmanual.io/

在定義資料模式基礎上, 增強了一些功能:

資料驗證

執行時型別錯誤提示

Data validation and settings management using python type annotations.

pydantic enforces type hints at runtime, and provides user friendly errors when data is invalid.

Define how data should be in pure, canonical python; validate it with pydantic.

from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel


class User(BaseModel):
    id: int
    name = 'John Doe'
    signup_ts: Optional[datetime] = None
    friends: List[int] = []


external_data = {
    'id': '123',
    'signup_ts': '2019-06-01 12:22',
    'friends': [1, 2, '3'],
}
user = User(**external_data)
print(user.id)
#> 123
print(repr(user.signup_ts))
#> datetime.datetime(2019, 6, 1, 12, 22)
print(user.friends)
#> [1, 2, 3]
print(user.dict())
"""
{
    'id': 123,
    'signup_ts': datetime.datetime(2019, 6, 1, 12, 22),
    'friends': [1, 2, 3],
    'name': 'John Doe',
}
"""
出處:http://www.cnblogs.com/lightsong/ 本文版權歸作者和部落格園共有,歡迎轉載,但未經作者同意必須保留此段宣告,且在文章頁面明顯位置給出原文連線。