elasticsearch之modeling your data（not flat）--Parent-child relationship

阿新 • • 發佈：2019-02-01

parent-child relationship跟nested objects在本質上是相似的，都是一個實體跟另一個實體相關聯。區別在於，nested objects中的相關實體在一個document中，而parent-chlld relationship中的實體是完全分離的。可以用一個實體關聯多個相關實體，是一對多的關係，跟nested object相比，優勢有：

第一：parent object可以單獨更新，而無需reindex the children。

第二：child document可以單獨更新，刪除，新增，不會影響parent 和其他child。這尤其適用於child數目很多，而且更新頻繁的場景。

第三那：查詢結果可以單獨返回child document。

elasticsearch在parent和child之間維護了一個map，由於這個map的作用，查詢時的join操作非常迅速。但是這也產生了一個侷限：parent和他所有的child必須在同一個shard上，不能跨shard。

1：parent-child mapping

為了建立父子關係，需要表明那一個是父型別哪一個是子型別。必須在索引建立的指定或者用updata-mapping api在子型別還未建立之前去更新設定。

假設我們有一個公司資料，公司在不同城市有不同的分部，每一個分部都有相關的員工資訊。現在我們要搜尋分部、單獨的員工、為某一個分部工作的員工，在這種情形下，nested model不適合了。當然我們可以採用application-side-joins或者data denormalization來實現，但是這個地方我們用parent-child來說明實現方法。

我們首先要告訴elasticsearch的是employee的父親型別是brance。因此我們設定mapping如下：

PUT /company
{"mappings":{"branch":{},"employee":{"_parent":{"type":"branch"}}}}

上邊表明了employee的parent型別是branch。

2：indexing parent and children

index parent跟普通的index data沒有什麼區別，parent無需知道children資訊：

POST /company/branch/_bulk
{"index":{"_id" 
:"london"}}{"name":"London Westminster","city":"London","country":"UK"}{"index":{"_id":"liverpool"}}{"name":"Liverpool Central","city":"Liverpool","country":"UK"}{"index":{"_id":"paris"}}{"name":"Champs Élysées","city":"Paris","country":"France"}

index child過程中，必須指定child對應的parent的id，來維持父子關係：

PUT /company/employee/1?parent=london {"name":"Alice Smith","dob":"1970-10-24","hobby":"hiking"}

上邊表明employee是在倫敦分部工作的。

parent id 有兩個用途：建立了父子之間的關係；確保父子儲存在同一個分片上。

elasticsearch將document定位到shard的機制中：

shard = hash(routing) % number_of_primary_shards

其中routing value預設是採用_id資訊。

當parent ID指定之後，將採用parent ID作為routing value，而不採用預設的_id資訊。也就是說父子用同一個routing value，所以可以位於同一個shard上。

parent id在所有single-request請求中都需要被指定：當用get請求檢索child document，或者是index，delete，update一個child document。跟search request需要檢索所有shard的機制不一樣，上述這些請求只會去檢索儲存對應documen的shard。，如果parend id沒有被指定，則請求有可能被定位到錯誤的shard。當用bulk api時，parend id也需要被指定：

POST /company/employee/_bulk
{"index":{"_id":2,"parent":"london"}}{"name":"Mark Thomas","dob":"1982-05-16","hobby":"diving"}{"index":{"_id":3,"parent":"liverpool"}}{"name":"Barry Smith","dob":"1979-04-01","hobby":"hiking"}{"index":{"_id":4,"parent":"paris"}}{"name":"Adrien Grand","dob":"1987-05-11","hobby":"horses"}

warn:如果想改變一個child document對應的parent value的值（parend id），僅僅改變child document對應的值是不可以的，因為這樣可能會導致跟parent document不在同一個shard上，因此正確的做法是先完整刪除the old child，然後在index the new child。

3：finding parents by their children

has_child型別的query和filter用於根據child資訊查詢parent資訊。比如：我們可以查詢哪些部門存在employee晚於1980出生的資訊：

GET /company/branch/_search
{"query":{"has_child":{"type":"employee","query":{"range":{"dob":{"gte":"1980-01-01"}}}}}}

跟nested object類似，has_child會匹配到很多child document，沒一個都有一個score值，score_mode用於控制這些分散的score如何整合為一個單一的score值（基於parent document）。預設是none（忽略child score，統一賦值1.0），其他設定avg，min，max，sum。

以下查詢將返回london and liverpool，london將會得到一個較高的score，因為Alice Smith的匹配度更高一些

GET /company/branch/_search
{"query":{"has_child":{"type":"employee","score_mode":"max""query":{"match":{"name":"Alice Smith"}}}}}

tip：score_mode的預設選項none，速度會比其他選項更快一些。因為es不需要計算沒一個child document的score值，統一設定為1.0

has_child query and filter同樣有兩個引數：min_children和max_children。滿足匹配到的最小/最大的child document的parent才會返回。

以下只返回滿足至少兩個employee的部門資訊：

GET /company/branch/_search
{"query":{"has_child":{"type":"employee","min_children":2,"query":{"match_all":{}}}}}

帶有min/max_children引數的has_child型別的query的效能跟不攜帶這兩個引數並啟用score機制的效能差不多。

has_child filter工作機制跟query幾乎一樣，只是不支援score_mode引數。

4：finding children  by their parents

nested query只能返回root document作為結果集，而parent-child是相互獨立的，每一個都能單獨查詢。has_child是根據child資訊返回parent資訊，而has_parent是根據parent資訊返回child資訊。

下邊返回：在UK工作的employee

GET /company/employee/_search
{"query":{"has_parent":{"type":"branch","query":{"match":{"country":"UK"}}}}}

has_parent也支援score_mode，但是隻有倆值：none/score。因為沒一個child只能有一個parent，所以沒有必要將多個score值統一為一個score值，所以選項就變成了要麼啟用score（score），要麼不啟用score（none：default）。

has_parent filter機制跟query，只是不支援score mode。

5：children aggregation

parent-child支援children aggregation，但是不支援parent aggregation（類似與reverse_nested）.

以下根據contry來統計employee最喜歡的hobby：

GET /company/branch/_search?search_type=count
{"aggs":{"country":{"terms":{"field":"country"},"aggs":{"employees":{"children":{"type":"employee"},"aggs":{"hobby":{"terms":{"field":"employee.hobby"}}}}}}}}

（1）：根據branch的country欄位bucket

（2）：children aggregation根據employee型別跟parent進行join

（3）：根據employee.hobby欄位進行bucket

6：grandparents and grandchildren

parent-child關係可以拓展到grandparent和grandchildren級別。但是讓然需要區別各個genaration需要在同一個分片上。

mapping：

PUT /company
{"mappings":{"country":{},"branch":{"_parent":{"type":"country"}},"employee":{"_parent":{"type":"branch"}}}}

indexing data：

POST /company/country/_bulk
{"index":{"_id":"uk"}}{"name":"UK"}{"index":{"_id":"france"}}{"name":"France"}

POST /company/branch/_bulk
{"index":{"_id":"london","parent":"uk"}}{"name":"London Westmintster"}{"index":{"_id":"liverpool","parent":"uk"}}{"name":"Liverpool Central"}{"index":{"_id":"paris","parent":"france"}}{"name":"Champs Élysées"}

以上london會根據parent為uk跟parent落在同一個shard上。

PUT /company/employee/1?parent=london
{"name":"Alice Smith","dob":"1970-10-24","hobby":"hiking"}

現在問題出現了：employee根據london進行routing，很有可能位於不同的shard上！！

所以我們需要指定一個額外的routing引數來確保跟parent /grandparent落在同一個shard上：

PUT /company/employee/1?parent=london&routing=uk {"name":"Alice Smith","dob":"1970-10-24","hobby":"hiking"}

這裡的routing value覆蓋了parent value。

查詢照常：比如我們返回employee喜歡hiking的country資訊，就需要join country with branch，and branch with employee。

GET /company/country/_search
{"query":{"has_child":{"type":"branch","query":{"has_child":{"type":"employee","query":{"match":{"hobby":"hiking"}}}}}}}

7：partical considerations

parent-child joins在管理存在關係的資料（索引效能比檢索效能更重要）的時候是非常有用的，但是也帶來了顯著的開銷。parent-child query的速度是nested query的5-10倍。

memory use：

目前parent-child的map資訊仍然在記憶體中，es有計劃用doc value去change map，這樣會節省不少記憶體，但是目前還沒有完成。在這之前，需要注意以下幾個方面:

每一個parent的string型別的_id資訊位於記憶體中，每一個child document需要8位元組（壓縮只需要1位元組）。

我們可以檢視parent-child cache的利用，用indices-stat api來獲取index level的資訊，用node-stat api來獲取node level的資訊。

GET /_nodes/stats/indices/id_cache?human

以上獲取id cache的在每一個node上的情況，格式易讀（human）

global ordinals and latecy：

parent-child用全域性序來加速join。不管parent-child map用mem cache還是on-disk doc value，在index發生任何改變的時候全域性序都需要重建。

同一個shard上的parent document越多，建立全域性序的時間就越長。parent-child最佳適用場景是：每一個parent都擁有很多child。而不是parent很多child很少的情況。

全域性序的建立是懶惰的。重新整理後的第一個parent-child query或者aggregation到來的時候開始建立。這將會導致一個較大的延遲。我們可以用eager_global_ordinals來把這種延遲從query time轉移到refresh time。

PUT /company
{"mappings":{"branch":{},"employee":{"_parent":{"type":"branch","fielddata":{"loading":"eager_global_ordinals"}}}}}

parent的全域性序在一個新的segment可用於檢索之前建立。

parents數量很多的情形下。全域性序的建立需要較長時間。我們可以增加refresh_interval，這樣refresh頻率降低，全域性序有效時間較長。這會降低每秒重建全域性序的cpu消耗。

multi-generations and concluding thoughts：

join multiple generation看上去很吸引人，但是要注意以下消耗：

join越多，效能越低。

每一個generation中parent id都需要存在記憶體中，消耗很大。

考慮你資料中存在的關係的scheme，如果適合parent-child，請考慮一下建議：

確保parent較少而children很多

避免在一個query中執行mutiple parent-child joins

避免score過程，講score_mode設定為none

parent id儘量精簡，減少記憶體使用

elasticsearch之modeling your data（not flat）--Parent-child relationship

elasticsearch之modeling your data（not flat）--Parent-child relationship

1015 - 計算幾何之多邊形的面積 - Build Your Home（POJ 3907）

android自定義view粒子效果之雨（not surfaceview）

Linux之修改主機名（永久生效）

ActiveMQ（22）：Consumer高級特性之消息分組（Message Groups）

Python學習之路——第二彈（認識python）

PHP後臺之調試手段（新手必備）

開啟Python取經之路-CLASS-6（Part 1）

列表操作之定義，切片（取元素）（Python）

2016"百度之星" - 初賽（Astar Round2A）--HDU 5690 |數學轉化+快速冪

排序算法之歸並排序（Merge Sort）

【第二篇】ASP.NET MVC快速入門之數據註解（MVC5+EF6）

Python之面向對象（初級篇）

Codeforces 849B Tell Your World （計算幾何）

隨筆之四個問題（作業二）

Laravel5.5執行 npm run dev時報錯，提示cross-env找不到（not found）的解決辦法

排序算法入門之希爾排序（java實現）

數據庫之數據查詢（DQL語句）

Redis集群方案之主從復制（待實踐）

JavaScript之Tab標簽（原始版）

elasticsearch之modeling your data（not flat）--Parent-child relationship

相關推薦