轉載 —— 通過Monstache實時同步MongoDB資料至Elasticsearch

阿新 • • 發佈：2022-05-07

通過Monstache實時同步MongoDB資料至Elasticsearch

當您的業務資料儲存在MongoDB中，並且需要進行語義分析和大圖展示時，可藉助阿里雲Elasticsearch實現全文搜尋、語義分析、視覺化展示等。本文介紹如何通過Monstache將MongoDB資料實時同步至阿里雲Elasticsearch，並對資料進行分析及展示。

背景資訊
本文以解析及統計熱門電影資料為例，提供的解決方案可以幫助您完成以下需求：
通過Monstache快速同步及訂閱全量或增量資料。
將MongoDB資料實時同步至高版本Elasticsearch。
解讀Monstache常用配置引數，應用於更多的業務場景。

方案優勢

Monstache基於MongoDB的oplog實現實時資料同步及訂閱，支援MongoDB與高版本Elasticsearch之間的資料同步，同時支援MongoDB的變更流和聚合管道功能，並且擁有豐富的特性。
Monstache不僅支援軟刪除和硬刪除，還支援資料庫刪除和集合刪除，能夠確保Elasticsearch端實時與源端資料保持一致。

操作流程
步驟一：環境準備
部署 MongoDB 主從複製模式，注意：一定要主從複製模式，因為同步需要使用
部署 Elasticsearch
部署 kibana 用於前端頁面與 es 互動，版本與 es 相同
安裝 Monstache，修改配置檔案，進行同步

步驟二：搭建Monstache環境

安裝Go，並配置環境變數。
說明由於Monstache資料同步依賴於Go語言，因此需要先在ECS中準備Go環境。

下載Go安裝包並解壓。

wget https://dl.google.com/go/go1.14.4.linux-amd64.tar.gz
tar -C /usr/local -xzf go1.14.4.linux-amd64.tar.gz

配置環境變數。

使用vim /etc/profile命令開啟環境變數配置檔案，並將如下內容寫入該檔案中。其中GOPROXY用來指定阿里雲Go模組代理。

export GOROOT=/usr/local/go
export GOPATH=/home/go/
export PATH=$PATH:$GOROOT/bin:$GOPATH/bin
export GOPROXY=https://mirrors.aliyun.com/goproxy/

應用環境變數配置。

source /etc/profile

安裝Monstache。

進入安裝路徑。

cd /usr/local/

從Git庫中下載安裝包。

git clone https://github.com/rwynn/monstache.git

說明如果出現git: command not found的錯誤提示，需要先執行yum install -y git命令安裝Git。

進入monstache目錄。

cd monstache

切換版本。
本文以rel5版本為例。

git checkout rel5

安裝Monstache。

go install

這一步可能出現問題：running gcc failed: exec: "gcc": executable file not found in $PATH

解決：

gcc -v

沒有就安裝 gcc

參考連結：https://blog.csdn.net/benben_2015/article/details/80565676

檢視Monstache版本。

monstache -v

執行成功後，預期結果如下。

5.5.7

步驟三：配置實時同步任務
Monstache配置使用TOML格式，預設情況下，Monstache會使用預設埠連線本地主機上的Elasticsearch和MongoDB，並追蹤MongoDB oplog。在Monstache執行期間，MongoDB的任何更改都會同步到Elasticsearch中。

由於本文使用阿里雲MongoDB和Elasticsearch，並且需要指定同步物件（mydb資料庫中的hotmovies和col集合），因此要修改預設的Monstache配置檔案。修改方式如下：

進入Monstache安裝目錄，建立並編輯配置檔案。

cd /usr/local/monstache/
vim config.toml

參考以下示例，修改配置檔案。
簡單的配置示例如下，詳細配置請參見Monstache Usage。

# connection settings

# connect to MongoDB using the following URL
mongo-url = 'mongodb://aaa:[email protected]:8010'
# connect to the Elasticsearch REST API at the following node URLs
elasticsearch-urls = ["http://140.179.9.1:8009"]

# frequently required settings

# if you need to seed an index from a collection and not just listen and sync changes events
# you can copy entire collections or views from MongoDB to Elasticsearch
# 可以直接全量複製過去
direct-read-namespaces = ["mydb.hotmovies","mydb.col"]

# if you want to use MongoDB change streams instead of legacy oplog tailing use change-stream-namespaces
# change streams require at least MongoDB API 3.6+
# if you have MongoDB 4+ you can listen for changes to an entire database or entire deployment
# in this case you usually don't need regexes in your config to filter collections unless you target the deployment.
# to listen to an entire db use only the database name.  For a deployment use an empty string.
#change-stream-namespaces = ["mydb.col"]

# additional settings

# if you don't want to listen for changes to all collections in MongoDB but only a few
# e.g. only listen for inserts, updates, deletes, and drops from mydb.mycollection
# this setting does not initiate a copy, it is only a filter on the change event listener
#namespace-regex = '^mydb\.col$'
# compress requests to Elasticsearch
#gzip = true
# generate indexing statistics
#stats = true
# index statistics into Elasticsearch
#index-stats = true
# use the following PEM file for connections to MongoDB
#mongo-pem-file = "/path/to/mongoCert.pem"
# disable PEM validation
#mongo-validate-pem-file = false
# use the following user name for Elasticsearch basic auth
# elasticsearch-user = "elastic"
# use the following password for Elasticsearch basic auth
# elasticsearch-password = "<your_es_password>"
# use 4 go routines concurrently pushing documents to Elasticsearch
elasticsearch-max-conns = 4
# use the following PEM file to connections to Elasticsearch
#elasticsearch-pem-file = "/path/to/elasticCert.pem"
# validate connections to Elasticsearch
#elastic-validate-pem-file = true
# propogate dropped collections in MongoDB as index deletes in Elasticsearch
dropped-collections = true
# propogate dropped databases in MongoDB as index deletes in Elasticsearch
dropped-databases = true
# do not start processing at the beginning of the MongoDB oplog
# if you set the replay to true you may see version conflict messages
# in the log if you had synced previously. This just means that you are replaying old docs which are already
# in Elasticsearch with a newer version. Elasticsearch is preventing the old docs from overwriting new ones.
#replay = false
# resume processing from a timestamp saved in a previous run
resume = true
# do not validate that progress timestamps have been saved
#resume-write-unsafe = false
# override the name under which resume state is saved
#resume-name = "default"
# use a custom resume strategy (tokens) instead of the default strategy (timestamps)
# tokens work with MongoDB API 3.6+ while timestamps work only with MongoDB API 4.0+
resume-strategy = 0
# exclude documents whose namespace matches the following pattern
#namespace-exclude-regex = '^mydb\.ignorecollection$'
# turn on indexing of GridFS file content
#index-files = true
# turn on search result highlighting of GridFS content
#file-highlighting = true
# index GridFS files inserted into the following collections
#file-namespaces = ["users.fs.files"]
# print detailed information including request traces
verbose = true
# enable clustering mode
# cluster-name = 'es-cn-mp91kzb8m00******'
# do not exit after full-sync, rather continue tailing the oplog
#exit-after-direct-reads = false
[[mapping]]
namespace = "mydb.hotmovies"
index = "hotmovies"
type = "movies"

[[mapping]]
namespace = "mydb.col"
index = "mydbcol"
type = "collection"

引數說明

mongo-url	MongoDB例項的主節點訪問地址。
elasticsearch-urls	阿里雲Elasticsearch例項的訪問地址，格式為http://<阿里雲Elasticsearch例項的內網地址>:9200。
direct-read-namespaces	指定待同步的集合，詳情請參見direct-read-namespaces。本文同步的資料集為mydb資料庫下的hotmovies和col集合。
change-stream-namespaces	如果要使用MongoDB變更流功能，需要指定此引數。啟用此引數後，oplog追蹤會被設定為無效。
namespace-regex	通過正則表示式指定需要監聽的集合。此設定可以用來監控符合正則表示式的集合中資料的變化。
elasticsearch-user	訪問阿里雲Elasticsearch例項的使用者名稱，預設為elastic。
注意 實際業務中不建議使用elastic使用者，這樣會降低系統安全性。建議使用自建使用者，並給予自建使用者分配相應的角色和許可權，詳情請參見通過Elasticsearch X-Pack角色管理實現使用者許可權管控。
elasticsearch-password	對應使用者的密碼。elastic使用者的密碼在建立例項時指定，如果忘記可進行重置，重置密碼的注意事項和操作步驟請參見重置例項訪問密碼。
elasticsearch-max-conns	定義連線Elasticsearch的執行緒數。預設為4，即使用4個Go執行緒同時將資料同步到Elasticsearch。
dropped-collections	預設為true，表示當刪除MongoDB集合時，會同時刪除Elasticsearch中對應的索引。
dropped-databases	預設為true，表示當刪除MongoDB資料庫時，會同時刪除Elasticsearch中對應的索引。
resume	預設為false。設定為true，Monstache會將已成功同步到Elasticsearch的MongoDB操作的時間戳寫入monstache.monstache集合中。當Monstache因為意外停止時，可通過該時間戳恢復同步任務，避免資料丟失。如果指定了cluster-name，該引數將自動開啟，詳情請參見resume。
resume-strategy	指定恢復策略。僅當resume為true時生效，詳情請參見resume-strategy。
verbose	預設為false，表示不啟用除錯日誌。
cluster-name	指定叢集名稱。指定後，Monstache將進入高可用模式，叢集名稱相同的程序將進行協調，詳情請參見cluster-name。
mapping	指定Elasticsearch索引對映。預設情況下，資料從MongoDB同步到Elasticsearch時，索引會自動對映為資料庫名.集合名。如果需要修改索引名稱，可通過該引數設定，詳情請參見Index Mapping。
說明 Monstache支援豐富的引數配置，以上配置僅使用了部分引數完成資料實時同步，如果您有更復雜的同步需求，請參見Monstache config和Advanced進行配置。

原文連結：https://help.aliyun.com/document_detail/171650.html
參考連結：https://www.cnblogs.com/agopher/p/15704630.html

轉載 —— 通過Monstache實時同步MongoDB資料至Elasticsearch

通過Monstache實時同步MongoDB資料至Elasticsearch

引數說明

轉載 —— 通過Monstache實時同步MongoDB資料至Elasticsearch

實時同步sqlserver資料，寫入kafka

騰訊大牛教你ClickHouse實時同步MySQL資料

詳解MongoDB資料還原及同步解決思路

mysql資料實時同步到Elasticsearch

rsync+inotify資料實時同步

Kettle+MongoDB 資料同步到MySQL

使用Spark SQL JDBC同步資料至MySQL

使用Streamsets將Oracle資料實時同步到MySQL中

Canal實現Mysql資料實時同步到數倉

天問已至，求索無疆：祝融號火星車首次通過環繞器傳回遙測資料

Rsync+Sersync資料實時同步（雙向）

利用ogg實現oracle到kafka的增量資料實時同步

資料實時同步中的一種特殊場景說明及處理方法

使用ogg實現oracle到kafka的增量資料實時同步

美團DB資料同步到資料倉庫的架構與實踐

mysql 5.7.21 解壓版通過歷史data目錄恢復資料的教程圖解

python3爬取資料至mysql的方法

python+mongodb資料抓取詳細介紹

利用 PyCharm 實現原生代碼和遠端的實時同步功能

轉載 —— 通過Monstache實時同步MongoDB資料至Elasticsearch

通過Monstache實時同步MongoDB資料至Elasticsearch

引數說明

相關推薦