如何高效地遍歷 MongoDB 超大集合?
- GitHub 倉庫:Fundebug/loop-mongodb-big-collection
本文使用的程式語言是 Node.js,連線 MongoDB 的模組用的是mongoose。但是,本文介紹的方法適用於其他程式語言及其對應的 MongoDB 模組。
錯誤方法:find()
也許,在遍歷 MongoDB 集合時,我們會這樣寫:
const Promise = require("bluebird"); function findAllMembers() { return Member.find(); } async function test() { const members = await findAllMembers(); let N = 0; await Promise.mapSeries(members, member => { N++; console.log(`name of the ${N}th member: ${member.name}`); }); console.log(`loop all ${N} members success`); } test();
注意,我們使用的是 Bluebird 的mapSeries而非map,members 陣列中的元素是一個一個處理的。這樣就夠了嗎?
當 Member 集合中的 document 不多時,比如只有 1000 個時,那確實沒有問題。但是當 Member 集合中有 1000 萬個 document 時,會發生什麼呢?如下:
<--- Last few GCs ---> rt of marking 1770 ms) (average mu = 0.168, current mu = 0.025) finalize [5887:0x43127d0] 33672 ms: Mark-sweep 1398.3 (1425.2) -> 1398.0 (1425.7) MB, 1772.0 / 0.0 ms (+ 0.1 ms in 12 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 1775 ms) (average mu = 0.088, current mu = 0.002) finalize [5887:0x43127d0] 35172 ms: Mark-sweep 1398.5 (1425.7) -> 1398.4 (1428.7) MB, 1496.7 / 0.0 ms (average mu = 0.049, current mu = 0.002) allocation failure scavenge might not succeed <--- JS stacktrace ---> FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory 1: 0x8c02c0 node::Abort() [node] 2: 0x8c030c [node] 3: 0xad15de v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node] 4: 0xad1814 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node] 5: 0xebe752 [node] 6: 0xebe858 v8::internal::Heap::CheckIneffectiveMarkCompact(unsigned long, double) [node] 7: 0xeca982 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [node] 8: 0xecb2b4 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node] 9: 0xecba8a v8::internal::Heap::FinalizeIncrementalMarkingIfComplete(v8::internal::GarbageCollectionReason) [node] 10: 0xecf1b7 v8::internal::IncrementalMarkingJob::Task::RunInternal() [node] 11: 0xbc1796 v8::internal::CancelableTask::Run() [node] 12: 0x935018 node::PerIsolatePlatformData::FlushForegroundTasksInternal() [node] 13: 0x9fccff [node] 14: 0xa0dbd8 [node] 15: 0x9fd63b uv_run [node] 16: 0x8ca6c5 node::Start(v8::Isolate*, node::IsolateData*, int, char const* const*, int, char const* const*) [node] 17: 0x8c945f node::Start(int, char**) [node] 18: 0x7f84b6263f45 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6] 19: 0x885c55 [node] Aborted (core dumped)
可知,記憶體不足了。
列印find()返回的 members 陣列可知,集合中所有元素都返回了,哪個陣列放得下 1000 萬個 Object?
正確方法:find().cursor()與 eachAsync()
將整個集合 find()全部返回,這種操作應該避免,正確的方法應該是這樣的:
function findAllMembersCursor() {
return Member.find().cursor();
}
async function test() {
const membersCursor = await findAllMembersCursor();
let N = 0;
await membersCursor.eachAsync(member => {
N++;
console.log(`name of the ${N}th member: ${member.name}`);
});
console.log(`loop all ${N} members success`);
}
test();
使用cursor()方法返回 QueryCursor,然後再使用eachAsync()就可以遍歷整個集合了,而且不用擔心記憶體不夠。
QueryCursor是什麼呢?不妨看一下 mongoose 文件:
A QueryCursor is a concurrency primitive for processing query results one document at a time. A QueryCursor fulfills the Node.js streams3 API, in addition to several other mechanisms for loading documents from MongoDB one at a time.
總之,QueryCursor 可以每次從 MongoDB 中取一個 document,這樣顯然極大地減少了記憶體使用。
如何測試?
這篇部落格介紹的內容很簡單,但是也很容易被忽視。如果大家測試一下,印象會更加深刻一些。
測試程式碼很簡單,大家可以檢視Fundebug/loop-mongodb-big-collection。
我的測試環境是這樣的:
- ubuntu 14.04
- mongodb 3.2
- nodejs 10.9.0
1. 使用 Docker 執行 MongoDB
sudo docker run --net=host -d --name mongodb daocloud.io/library/mongo:3.2
2. 使用mgodatagen生成測試資料
使用 mgodatagen,1000 萬個 document 可以在 1 分多鐘生成!
下載 mgodatagen:https://github.com/feliixx/mgodatagen/releases/download/0.7.3/mgodatagen_linux_x86_64.tar.gz
解壓之後,複製到/usr/local/bin 目錄即可:
sudo mv mgodatagen /usr/local/bin
mgodatagen 的配置檔案mgodatagen-config.json如下:
[
{
"database": "test",
"collection": "members",
"count": 10000000,
"content": {
"name": {
"type": "string",
"minLength": 2,
"maxLength": 8
},
"city": {
"type": "string",
"minLength": 2,
"maxLength": 8
},
"country": {
"type": "string",
"minLength": 2,
"maxLength": 8
},
"company": {
"type": "string",
"minLength": 2,
"maxLength": 8
},
"email": {
"type": "string",
"minLength": 2,
"maxLength": 8
}
}
}
]
執行mgodatagen -f mgodatagen-config.json
命令,即可生成 10000 萬測試資料。
mgodatagen -f mgodatagen-config.json
Connecting to mongodb://127.0.0.1:27017
MongoDB server version 3.2.13
collection members: done [====================================================================] 100%
+------------+----------+-----------------+----------------+
| COLLECTION | COUNT | AVG OBJECT SIZE | INDEXES |
+------------+----------+-----------------+----------------+
| members | 10000000 | 108 | _id_ 95368 kB |
+------------+----------+-----------------+----------------+
run finished in 1m12.82s
檢視 MongoDB,可知新生成的資料有 0.69GB,其實很小,但是使用 find()方法遍歷會報錯。
show dbs
local 0.000GB
test 0.690GB
3. 執行測試程式碼
兩種不同遍歷方法的程式碼分別位於test1.js和test2.js。
參考
關於Fundebug
Fundebug專注於JavaScript、微信小程式、微信小遊戲、支付寶小程式、React Native、Node.js和Java線上應用實時BUG監控。 自從2016年雙十一正式上線,Fundebug累計處理了10億+錯誤事件,付費客戶有Google、360、金山軟體、百姓網等眾多品牌企業。歡迎大家免費試用!
版權宣告
轉載時請註明作者Fundebug以及本文地址: https://blog.fundebug.com/2019/03/21/how-to-visit-all-documents-in-a-big-colle