【ElasticSearch】(六)淺析Scroll
【起因】
正常查某索引下全部資料的dsl舉例如下:
POST /fcar_city/city/_search?scroll=10m
{
"query": {
"bool": {
"must": [
{
"match_all": { }
}
]
}
}
}
我的意圖是把該索引下的全部資料查詢出來,上述程式碼查詢結果如下:
{ "_shards": { "total": 5, "failed": 0, "successful": 5 }, "hits": { "hits": [ { "_index": "fcar_city", "_type": "city", "_source": { "t_b_city|administrative_name": "揚州", "t_b_city|create_emp": "1", "t_b_city|create_time": "2016-06-28 11:59:58", "t_b_city|id": "60", "t_b_city|modify_time": "2016-06-28 11:59:58", "t_b_city|operate_range": "1", "t_b_city|channel_status": "2", "t_b_city|is_business": "1", "t_b_city|modify_emp": "1", "t_b_city|name": "揚州", "t_b_city|en_name": "yz" }, "_id": "60", "_score": 1 }, { "_index": "fcar_city", "_type": "city", "_source": { "t_b_city|administrative_name": "通化", "t_b_city|create_emp": "1", "t_b_city|create_time": "2016-06-28 11:59:58", "t_b_city|id": "44", "t_b_city|modify_time": "2016-06-28 11:59:58", "t_b_city|operate_range": "1", "t_b_city|channel_status": "2", "t_b_city|is_business": "1", "t_b_city|modify_emp": "1", "t_b_city|name": "通化", "t_b_city|en_name": "th" }, "_id": "44", "_score": 1 }, { "_index": "fcar_city", "_type": "city", "_source": { "t_b_city|create_emp": "1", "t_b_city|create_time": "2016-06-28 11:59:58", "t_b_city|modify_time": "2016-10-09 08:40:00", "t_b_city|center_lat": "28.656386", "t_b_city|is_business": "1", "t_b_city|modify_emp": "253", "t_b_city|name": "台州", "t_b_city|en_name": "tz", "t_b_city|administrative_name": "台州", "t_b_city|id": "48", "t_b_city|operate_range": "2", "t_b_city|channel_status": "2", "t_b_city|status": "2", "t_b_city|center_lon": "121.420757" }, "_id": "48", "_score": 1 }, { "_index": "fcar_city", "_type": "city", "_source": { "t_b_city|administrative_name": "咸陽", "t_b_city|create_emp": "1", "t_b_city|create_time": "2016-06-28 11:59:58", "t_b_city|id": "52", "t_b_city|modify_time": "2016-06-28 11:59:58", "t_b_city|operate_range": "1", "t_b_city|channel_status": "2", "t_b_city|is_business": "1", "t_b_city|modify_emp": "1", "t_b_city|name": "咸陽", "t_b_city|en_name": "xiy" }, "_id": "52", "_score": 1 }, { "_index": "fcar_city", "_type": "city", "_source": { "t_b_city|administrative_name": "煙臺", "t_b_city|create_emp": "1", "t_b_city|create_time": "2016-06-28 11:59:58", "t_b_city|id": "29", "t_b_city|modify_time": "2016-06-28 11:59:58", "t_b_city|operate_range": "1", "t_b_city|channel_status": "2", "t_b_city|is_business": "1", "t_b_city|modify_emp": "1", "t_b_city|name": "煙臺", "t_b_city|en_name": "yt" }, "_id": "29", "_score": 1 }, { "_index": "fcar_city", "_type": "city", "_source": { "t_b_city|administrative_name": "晉城", "t_b_city|create_emp": "1", "t_b_city|create_time": "2016-06-28 11:59:58", "t_b_city|id": "40", "t_b_city|modify_time": "2016-06-28 11:59:58", "t_b_city|operate_range": "1", "t_b_city|channel_status": "2", "t_b_city|is_business": "1", "t_b_city|modify_emp": "1", "t_b_city|name": "晉城", "t_b_city|en_name": "jc" }, "_id": "40", "_score": 1 }, { "_index": "fcar_city", "_type": "city", "_source": { "t_b_city|administrative_name": "聊城", "t_b_city|create_emp": "1", "t_b_city|create_time": "2016-06-28 11:59:58", "t_b_city|id": "41", "t_b_city|modify_time": "2016-06-28 11:59:58", "t_b_city|operate_range": "1", "t_b_city|channel_status": "2", "t_b_city|is_business": "1", "t_b_city|modify_emp": "1", "t_b_city|name": "聊城", "t_b_city|en_name": "lc" }, "_id": "41", "_score": 1 }, { "_index": "fcar_city", "_type": "city", "_source": { "t_b_city|administrative_name": "柳州", "t_b_city|create_emp": "1", "t_b_city|create_time": "2016-06-28 11:59:58", "t_b_city|id": "22", "t_b_city|modify_time": "2016-06-28 11:59:58", "t_b_city|operate_range": "1", "t_b_city|channel_status": "2", "t_b_city|is_business": "1", "t_b_city|modify_emp": "1", "t_b_city|name": "柳州", "t_b_city|en_name": "lz" }, "_id": "22", "_score": 1 }, { "_index": "fcar_city", "_type": "city", "_source": { "t_b_city|administrative_name": "萍鄉", "t_b_city|create_emp": "1", "t_b_city|create_time": "2016-06-28 11:59:58", "t_b_city|id": "24", "t_b_city|modify_time": "2016-06-28 11:59:58", "t_b_city|operate_range": "1", "t_b_city|channel_status": "2", "t_b_city|is_business": "1", "t_b_city|modify_emp": "1", "t_b_city|name": "萍鄉", "t_b_city|en_name": "px" }, "_id": "24", "_score": 1 }, { "_index": "fcar_city", "_type": "city", "_source": { "t_b_city|administrative_name": "隨州", "t_b_city|create_emp": "1", "t_b_city|create_time": "2016-06-28 11:59:58", "t_b_city|id": "25", "t_b_city|modify_time": "2016-06-28 11:59:58", "t_b_city|operate_range": "1", "t_b_city|channel_status": "2", "t_b_city|is_business": "1", "t_b_city|modify_emp": "1", "t_b_city|name": "隨州", "t_b_city|en_name": "sz" }, "_id": "25", "_score": 1 } ], "total": 152, "max_score": 1 }, "took": 3, "timed_out": false }
不難發現,tota顯示l一共152條,但是預設只查了10條,這就是我前幾天遇到的一個問題。
鑑於上一篇部落格,我嘗試通過使用from,size搭配,改寫了dsl,如下:
POST /fcar_city/city/_search { "query": { "bool": { "must": [ { "match_all": { } } ] } }, "from": 0, "size": 1000 }
可見,此時已經查出來全部的152條記錄,但是通過from,size查詢,就像我上一篇部落格所說,可能會耗費效能較大,而且導致“Result window is too large”的問題,之後通過查詢官方網站,scroll走進我的視線裡。
【Scroll】
es官方對scroll特性介紹的第一句話是這樣:
A scroll query is used to retrieve large numbers of documents from Elasticsearch efficiently, without paying the penalty of deep pagination.
即scroll適用於大量資料的查詢,而且無需擔心深度分頁帶來的問題。
基本寫法如下:
GET /old_index/_search?scroll=1m
{
"query": { "match_all": {}},
"sort" : ["_doc"],
"size": 1000
}
注意2點:
(1)scroll=1m,代表scroll開啟時間為1分鐘;
(2)“_doc”是最有效的排序手段。
當在“_search”之後使用了“scroll”,即使“size”設定的很大,也不會出現“Result window is too large”問題,親測。而且對cup佔用過大對問題也沒有出現,原因就在於scroll的原理上。其中的奧妙就在這2段介紹中:
Scrolling allows us to do an initial search and to keep pulling batches of results from Elasticsearch until there are no more results left. It’s a bit like a cursor in a traditional database.
A scrolled search takes a snapshot in time. It doesn’t see any changes that are made to the index after the initial search request has been made. It does this by keeping the old data files around, so that it can preserve its “view” on what the index looked like at the time it started.
可見,scroll所查詢的,正式某一個時刻的“snapshot”,類似於檢視,所以說,對於實時性要求特別高的場景,不適合適用scroll,l列表查詢的話,通過from,size也是OK的。查詢“字典表”的所有資料,適用scroll就很有必要。
同時要滾動檢視結果,我們執行搜尋請求並將scroll
值設定為我們要保持滾動視窗開啟的時間長度。每次執行滾動請求時都會重新整理滾動到期時間,因此只需要足夠長的時間來處理當前批次的結果,而不是所有與查詢匹配的文件。超時非常重要,因為保持滾動視窗開啟會消耗資源,我們希望在不再需要它們時立即釋放它們。設定超時使Elasticsearch能夠在一段時間不活動後自動釋放資源。
so,that's all. 後續分享java程式碼對scroll的封裝。