monstache同步mongo資料到es並保證高可用

阿新 • • 發佈：2021-12-18

monstache同步mongo資料到es並保證高可用

需求 & 問題描述

我們需要將MongoDB的資料實時同步到Elasticsearch中（包括資料變更），在評估了AWS DMS和Monstache之後，暫定選擇Monstache外掛同步資料

什麼是Monstache？

Monstache 是Golang語言實現的基於MongoDB的oplog實現實時資料同步及訂閱的外掛，支援MongoDB與ES之間的資料同步。其中MongoDB需要搭建副本集。

相關文件 *https://help.aliyun.com/document_detail/171650.html*

實踐

monstache是通過配置檔案啟動的，配置引數比較豐富

monstache 啟動需要指定配置檔案並取名config.toml，如下

# connection settings
# print detailed information including request traces
#啟用除錯日誌，這項要放在最上面，否則日誌列印不到檔案
verbose = true
# connect to MongoDB using the following URL
# 指定mongo 連線地址,一定要搭建mongodb叢集
mongo-url = "mongodb://192.168.7.51:27021"
#"mongodb://root:<your_mongodb_password>@dds-bp1aadcc629******.mongodb.rds.aliyuncs.com:3717"
# connect to the Elasticsearch REST API at the following node URLs
# 指定es 連線地址
elasticsearch-urls = ["http://localhost:9200"]

# frequently required settings
# if you need to seed an index from a collection and not just listen and sync changes events
# you can copy entire collections or views from MongoDB to Elasticsearch
# 要監聽的mongodb的集合格式是 庫名.集合名
direct-read-namespaces = ["mssiot_forum_merossbeta.f_posts"]

# if you want to use MongoDB change streams instead of legacy oplog tailing use change-stream-namespaces
# change streams require at least MongoDB API 3.6+
# if you have MongoDB 4+ you can listen for changes to an entire database or entire deployment
# in this case you usually don't need regexes in your config to filter collections unless you target the deployment.
# to listen to an entire db use only the database name.  For a deployment use an empty string.
#change-stream-namespaces = ["mydb.col"]

# additional settings

# if you don't want to listen for changes to all collections in MongoDB but only a few
# e.g. only listen for inserts, updates, deletes, and drops from mydb.mycollection
# this setting does not initiate a copy, it is only a filter on the change event listener
#namespace-regex = '^mssiot_forum_merossbeta\.f_posts$'
# compress requests to Elasticsearch
#gzip = true
# generate indexing statistics
#stats = true
# index statistics into Elasticsearch
#index-stats = true
# use the following PEM file for connections to MongoDB
#mongo-pem-file = "/path/to/mongoCert.pem"
# disable PEM validation
#mongo-validate-pem-file = false
# use the following user name for Elasticsearch basic auth
elasticsearch-user = "elastic"
# use the following password for Elasticsearch basic auth
#elasticsearch-password = "<your_es_password>"
# use 8 go routines concurrently pushing documents to Elasticsearch
#monstache最多開幾個執行緒同步到es,預設為4
elasticsearch-max-conns = 8
# use the following PEM file to connections to Elasticsearch
#elasticsearch-pem-file = "/path/to/elasticCert.pem"
# validate connections to Elasticsearch
#elastic-validate-pem-file = true
# propogate dropped collections in MongoDB as index deletes in Elasticsearch
#mongodb刪除集合或庫時是否同步刪除es中的索引
dropped-collections = false
# propogate dropped databases in MongoDB as index deletes in Elasticsearch
dropped-databases = false
# do not start processing at the beginning of the MongoDB oplog
# if you set the replay to true you may see version conflict messages
# in the log if you had synced previously. This just means that you are replaying old docs which are already
# in Elasticsearch with a newer version. Elasticsearch is preventing the old docs from overwriting new ones.
#replay = false
# resume processing from a timestamp saved in a previous run
# 從上一個時間點恢復
resume = true
# do not validate that progress timestamps have been saved
#resume-write-unsafe = false
# override the name under which resume state is saved
#resume-name = "default"
# use a custom resume strategy (tokens) instead of the default strategy (timestamps)
# tokens work with MongoDB API 3.6+ while timestamps work only with MongoDB API 4.0+
resume-strategy = 0
# exclude documents whose namespace matches the following pattern
#namespace-exclude-regex = '^mydb\.ignorecollection$'
# turn on indexing of GridFS file content
#index-files = true
# turn on search result highlighting of GridFS content
#file-highlighting = true
# index GridFS files inserted into the following collections
#file-namespaces = ["users.fs.files"]
# enable clustering mode
# 指定monstache叢集名，高可用模式中很重要
cluster-name = 'merossdev'
# worker模式
#workers = ["Tom", "Dick", "Harry"]
# do not exit after full-sync, rather continue tailing the oplog
#exit-after-direct-reads = false
namespace-regex = '^mssiot_forum_merossbeta\.(f_posts|\$cmd)$'

[[mapping]]
namespace = "mssiot_forum_merossbeta"
index = "f_posts"

#生產環境記錄日誌必不可少，monstache預設是輸出到標準輸出的，這裡指定它輸出到指定的日誌檔案（這個也是踩坑踩出來的哦！）
#[logs]
#info = "/var/monstache/log/info.log"
#warn = "/var/monstache/log/wran.log"
#error = "/var/monstache/log/error.log"
#trace = "/var/monstache/log/trace.log"

啟動方式 monstache -cluster-name merossdev -f config.toml，monstache是編譯好的二進位制檔案，如下圖所示

現在往mongodb寫入一條資料，再去查詢es，如下圖所示

由此，經過實踐，MongoDB對文件的其他操作同理，都會同步到es

Monstache的高可用之普通模式和多worker模式，****配置檔案裡面的cluster-name需要開啟，cluster-name="你自定義monstache叢集名字"

1. 基於普通方式

原理： When cluster-name is given monstache will enter a high availablity mode. Processes with cluster name set to the same value will coordinate. Only one of the processes in a cluster will sync changes. The other processes will be in a paused state. If the process which is syncing changes goes down for some reason one of the processes in paused state will take control and start syncing. See the section

high availability for more information. 意思是在一個叢集裡面，只有一個程序會同步資料，其他程序處於Pausing狀態，如果同步資料程序掛掉，其他的某一個Pausing狀態程序會升級為監聽狀態相關文件：https://rwynn.github.io/monstache-site/config/
****執行命令：monstache -cluster-name merossdev -f config.toml

在終端連續兩次執行該命令，便啟動了兩個monstache程序，其中一個程序在監聽同步狀態，另一個處於Pausing，如下圖所示

上圖為監聽狀態

上圖為pausing狀態

現在我們把正在監聽的程序殺掉，驗證一下處於pausing狀態的程序是否會切換為監聽狀態，如下圖所示

上圖驗證了處於pausing狀態的程序已經切換為監聽狀態

\2. 基於多worker的方式

原理：workers- You can run multiple monstache processes and distribute the work between them. First configure the names of all the workers in a shared config.toml file. You can run monstache in high availability mode by starting multiple processes with the same value for cluster-name. Each process will join a cluster which works together to ensure that a monstache process is always syncing to Elasticsearch. 意思是多個worker協同工作, 在相同叢集名下的所有worker都會同步資料，都不會處於pausing狀態。叢集名與worker名相同的程序，如果其中某一個程序處於監聽狀態，另一個會處於pausing狀態，當處於監聽狀態的程序掛掉之後，同名的程序由pausing狀態升級為監聽狀態。你不能指定works列表之外的worker來啟動程序。相關文件 https://rwynn.github.io/monstache-site/advanced/#high-availability

前提條件：需要在配置檔案指定workers : workers = ["Tom", "Dick", "Harry"]

執行命令：monstache -cluster-name HA -worker Tom -f config.toml

monstache -cluster-name HA -worker Dick -f config.toml

monstache -cluster-name HA -worker Harry -f config.toml

驗證：我們往MongoDB同時寫入10000條資料，monstache會hash所有worker，並把文件id交給某一個worker去執行，如下圖

現在mongodb裡面有10000條資料，es等待同步

啟動三個worker，我們發現每個worker都同步相對量的資料

通過查詢es,10000條資料已經同步完畢

Monstache的高可用之普通模式和多worker模式的比較

1 普通模式

優勢：部署相對簡單

劣勢：處理資料較慢，原因是普通模式就只有一個worker在工作，然後指定你想要的goroutine去消費資料（該配置可以彌補多worker的方式）

2 多worker模式

優勢：同步效率近乎實時，因為多worker同時工作，並且每一個worker還可以指定多個goroutine去消費資料，併發能力更高

劣勢：部署相對繁瑣

總結：由於普通模式和多worker模式在同步時間上其實相差不大，普通模式同步一萬資料只需要1.5秒的時間（已在本地驗證）且部署相對簡單，所以最終選擇普通模式

關於monstache的eks部署請參見：https://www.cnblogs.com/agopher/p/15704633.html

本文所引用的文件：

官方文件：https://rwynn.github.io/monstache-site/advanced/#high-availability

動手實踐文件：https://help.aliyun.com/document_detail/171650.html#title-8gf-qh2-3qj

monstache同步mongo資料到es並保證高可用

monstache同步mongo資料到es並保證高可用需求 & 問題描述我們需要將MongoDB的資料實時同步到Elasticsearch中（包括資料變更），在評估了AWS DMS和Monstache之後，暫定選擇Monstache外掛同步資料

微軟官方詳解海洋環保滑鼠，採用特殊樹脂，並保證高質量標準

10 月 11 日訊息在 9 月 22 日晚間的微軟 Windows 11 及 Surface 新品釋出會上，微軟釋出了海洋環保滑鼠，其外殼中 20% 的材料採用海洋回收塑料，由從海洋及河流中回收的塑料廢品製成，其採用的是一項突破性的材料技

【大資料】Hadoop的高可用叢集(HA)部署

這裡基於之前的博文，即在全分散式安裝的基礎上增量部署高可用叢集。叢集部署表如下：

redis淘汰+過期雙向保證高可用 | redis 為什麼那麼快？

前言 redis和資料相比除了他們的結構型顛覆以外！還有他們儲存位置也是不相同。傳統資料庫將資料儲存在硬碟上每次資料操作都需要IO而Redis是將資料儲存在記憶體上的。這裡稍微解釋下IO是啥意思。IO就是輸入流輸出流

MySQL是怎麼保證高可用的？

正常情況下，只要主庫執行更新生成的所有binlog，都可以傳到備庫並被正確地執行，備庫就能達到跟主庫一致的狀態，這就是最終一致性。

（六）FastDFS 高可用叢集架構學習---後期運維--組內機器間資料同步（摘錄）

一、Binlog同步概述 FastDFS中為了維護檔案的多個副本，會在同組的Storage之間互相同步檔案，也就是一個備份過程，若一組有三臺機器，那麼互相備份後，一個檔案就有三個副本。本篇將主要描述Binlog同步的相關概念，

doris-根據車輛隨機獲取該車的所有資料並排序-高併發壓測

select * from tj23 where vin=\'\"+rand.nextInt(1000000)+\"\' order by inday desc limit 10 50併發：Thread[Thread-28,5,main]===>Mon Nov 08 14:24:50 CST 2021-----Mon Nov 08 14:24:52 CST 2021----2024-

轉載 —— 通過Monstache實時同步MongoDB資料至Elasticsearch

通過Monstache實時同步MongoDB資料至Elasticsearch 當您的業務資料儲存在MongoDB中，並且需要進行語義分析和大圖展示時，可藉助阿里雲Elasticsearch實現全文搜尋、語義分析、視覺化展示等。本文介紹如何通過Monstach

訊息佇列高可用如何保證（rabbitMQ、Kafak）

從石杉碼農課程整理而來，已補圖 1、面試題如何保證訊息佇列的高可用啊？ 2、面試官心理分析

如何保證訊息佇列的高可用？

RabbitMQ 的高可用性 RabbitMQ 是比較有代表性的，因為是基於主從（非分散式）做高可用性的，我們就以 RabbitMQ 為例子講解第一種 MQ 的高可用性怎麼實現。

MySQL建立資料表並建立主外來鍵關係詳解

前言為mysql資料表建立主外來鍵需要注意以下幾點：需要建立主外來鍵關係的兩個表的儲存引擎必須是InnoDB。

美團DB資料同步到資料倉庫的架構與實踐

背景在資料倉庫建模中，未經任何加工處理的原始業務層資料，我們稱之為ODS(Operational Data Store)資料。在網際網路企業中，常見的ODS資料有業務日誌資料（Log）和業務DB資料（DB）兩類。對於業務DB資料來說，從My

mongo資料集合屬性中存在點號(.)的解決方法

前言 MongoDB是面向集合儲存的文件型資料庫，其涉及到的基本概念與關係型資料庫比有所不同。本文主要介紹關於mongo資料集合屬性存在點號(.)的相關內容，下面話不多說了，來一起看看詳細的介紹吧

分散式資料庫的資料一致性怎麼保證（其中有raft演算法）

分散式資料庫的資料一致性管理是其最重要的核心技術之一，也是保證分散式資料庫滿足資料庫最基本的ACID特性中的 “一致性”(Consistency)的保障。在分散式技術發展下，資料一致性的解決方法和技術也在不斷的演進，本

vue中樹狀結構轉行資料，並渲染成table的方法

場景：我們現在有一個樹狀結構的資料，如下圖：大概的資料結構如下： const tree = {

Vue + Element 實現下拉選擇統計時間資料欄並展示

效果如下程式碼如下

djongo 前端頁面展示自定義api返回的列表資料，並拼接到table上

前端頁面： 1 {% extends \'base.html\' %} 2 {% load static %} 3 {% load bootstrap3 %} 4 {% load i18n %}

13、Visitor 訪問者模式訪問資料結構並處理資料行為型設計模式

1、模式的定義與特點訪問者（Visitor）模式的定義：將作用於某種數據結構中的各元素的操作分離出來封裝成獨立的類，使其在不改變數據結構的前提下可以添加作用於這些元素的新的操作，為數據結構中的每個元素提供多種

大資料實戰（二十四）：電商數倉（十七）之使用者行為資料採集（十七）高可用mysql （HA mysql，ubuntu）

0 架構一安裝mysql 分別在hadoop102 與hadoop103 安裝mysql，安裝過程見：大資料實戰（二十三）：電商數倉（十六）之使用者行為資料採集（十六）Ubuntu mysql 安裝

MySql查詢兩張相同表，合併成一組資料，並區分資料的不同

SELECT * FROM ( SELECT `title`, \'img\' AS TYPE, `id`, `orderid`, `posttime`, `content`, `description`, `checkinfo`,

monstache同步mongo資料到es並保證高可用

需求 & 問題描述

什麼是Monstache？

實踐

monstache是通過配置檔案啟動的，配置引數比較豐富

Monstache的高可用之普通模式和多worker模式，****配置檔案裡面的cluster-name需要開啟，cluster-name="你自定義monstache叢集名字"

Monstache的高可用之普通模式和多worker模式的比較

相關推薦

Monstache的高可用之普通模式和多worker模式，配置檔案裡面的cluster-name需要開啟，cluster-name="你自定義monstache叢集名字"