Tick or Tock? Keeping Time and Order in Distributed Databases
https://en.pingcap.com/blog/Time-in-Distributed-Systems
Preface
At re:Invent 2017, Amazon Web Services (AWS) announcedAmazon Time Sync Service, a highly accurate and reliable time reference that is natively accessible from Amazon EC2 instances. It is much like the GoogleTrueTimepublished in 2012. Why do Google and AWS both want to make efforts to provide global time service? Is there any inspiration for building distributed database? This topic is important to think about.
Time synchronization remains a hard nut to crack in distributed systems, especially for distributed databases such asTiDBwhere time is used to confirm the order of the transaction to guarantee the ACID compliance.
In this post, I will introduce the existing solutions to tackle the time synchronization issue in distributed systems, as well as their pros and cons. I will also share why we chose to use the timestamp oracle (TSO) from
Order of the events
Linearizabilityis important for distributed systems, especially for distributed databases. We can't allow reading stale value after the update. For example, if account A has $100, and transfers $10 to B. After the transaction finishes, we can't read $100 again from account A.
Another simple explanation is: if we write a data at timeT1
likeset a = 10
, afterT1
, we must always read10
as the value ofa
, not11
or any other values.
But how can we ensure we can read the newest data? How can we know the order of two events?
We usually use "happened before"(hb
or->
) to describe the relationship of two causal events. For two eventse1
ande2
,e1 -> e2
, we can say thate1
happened beforee2
, ore1
causese2
.
If we have only one process, determining the order is easy, because all the events can only happen in the process in sequence. However, in a distributed system, things become more complex. The events may happen in different places, and it becomes hard to determine the order of all the events.
At first, we may consider using wall time, but the time is not the same in all the processes -- one process may run faster, and the other process may walk slower. So we can't use the time directly to check the order of the events. Luckily, we have other ways to do it.
Logical clock
A simple way is to uselogical clockwhich was proposed by Lamport in 1978 for timestamping and ordering events in a distributed system. The algorithm is:
-
Every process starts with an initialized clock counter.
-
A process increments the value of its counter before each event in that process.
-
When a process sends a message, it includes the counter value with the message.
-
When a process receives a message, it increases its clock counter value to be bigger than both its current clock counter value and the value in the received message.
-
If two events have the same clock value, we may think they happen simultaneously so we should use process ID or any other unique ID to differentiate them.
Using this can easily determine the order of events. For example, assuming we have processesP1
,P2
, both with an initialized counter value0
.
-
P1
increases the counter value to1
and executes eventA
. -
P2
wants to do something but needs to know the result of the eventA
, so it increases its counter to1
and sends a message with counter1
toP1
. -
P1
receives the message and increases its counter to2
, then increases the counter to3
and replies toP2
. -
P2
receives the message and increases the counter to4
then executes eventB
with counter5
.
We useC(A)
andC(B)
as the counter value of the events, if eventsA
andB
happen in one process, andA
happens beforeB
, we can know thatC(A) < C(B)
. IfA
andB
happen in different processes, we can also knowC(A) < C(B)
based on the message, so if A happens before B, we can inferC(A) < C(B)
. But ifC(A) < C(B)
, it doesn't necessarily mean that A happens before B.
If the two events are not causally related (no communication between the processes), we can't determine the order of the events. We can use vector clock to fix this. But whether it's logical clock or vector clock, they both have a disadvantage: we can't know what time the event happens because both of the two clocks merely record the order of the events instead of the time.
To get the chronological order of the events, we have to go back to square one and use real time. But we can't depend on Network Time Protocol (NTP) directly because it has some errors and the time is not accurate, so what should we do?
TrueTime
In GoogleSpanner, it uses TrueTime API to fix the problem of time. Spanner uses GPS and Atomic Clock to correct the time and can guarantee clock uncertainty bound (ε) is very small. The ε value is less than 7ms in 2012; it may be less now.
The TrueTime API is very simple:
Method | Return |
TT.now() | TTinterval: [earliest, latest] |
TT.after(t) | true if t has definitely passed |
TT.before(t) | true if t has definitely not arrived |
Spanner can't get an accurate time point but only an interval. We can use functionnow
to get an interval. Assuming eventA
happens at the time pointtt.a
, andtt.b
is for eventB
. We can know thattt.a.earliest <= tt.a <= tt.a.latest
. For eventA
andB
, ifA
happens beforeB
, we can infertt.a.latest < tt.b.earliest
.
Because TrueTime has the clock uncertainty boundε
, so for every transaction commit, it must wait2ε
time to guarantee linearizability, butε
is so small that the performance is still high.
The biggest hurdle to adopting TrueTime is that it depends on special hardware, such as GPS clocks and atomic clocks, which many companies do not have.
Download TiDBSubscribe to BlogHybrid Logical Clock
Hybrid logical clock(HLC) is another way for timekeeping and timestamping in distributed systems.
Based on NTP, HLC can only read time from NTP, but it won't change it. HLC contains two parts: physical clock + logical clock. For example, assuming:
-
pt
: the physical time -
l
: the maximum ofpt
information learned so far -
c
: the logical clock
To compare the order of two events, we can first check theirl
time, if equal, we can checkc
time, for any two eventse
andf
, ife
happened beforef
, we can know(l.e, c.e) < (l.f, c.f)
.
The HLC algorithm for nodej
:
-
Initialize
l.j = 0
andc.j = 0
when nodej
starts up. -
Send a message to another node, or a local event happens:
l'.j = l.j l.j = max(l'.j, pt.j) if (l.j = l'.j) then c.j = c.j + 1 else c.j = 0 Timestamp with l.j, c.j
-
Receive a message from node
m
.l'.j = l.j l.j = max(l'.j, l.m, pt.j) if (l.j = l'.j = l.m) then c.j = max(c.j, c.m) + 1 else if (l.j = l'.j) then c.j = c.j + 1 else if (l.j = l.m) then c.j = c.m + 1 else c.j = 0 Timestamp with l.j, c.j
As we can see, HLC is very easy to implement and doesn't depend on hardware. But HLC is not the silver bullet to solve the time synchronization problem of distributed systems. HLC still needs to guarantee|l.e - pt.e| <= ε
to make HLC bounded, because sometimes the user wants to use the physical timestamp to query the events directly, and if the HLC is not unbounded, we can't know whether the event happens or not at this time.
HLC still has a bound valueε
, so for the transaction, we still need to handle the wait time problem which exists in Spanner with TrueTime.
To tolerate the NTP synchronization error to the greatest extent, we may use a big value forε
. Some system uses 250 ms or 500 ms by default. Obviously, these default values are big and can cause a high latency for the transaction. The largeε
value has little impact when supporting multiple data centers because the network latency is high as well and might be even higher than the value of the clock offset. When we send a message to the remote node, we don't need to wait for too much time after we subtract the network transfer time.
But what can we do if the NTP is not working as expected? Start panicking? Or just ignore this error?
Why we choose TSO?
As a distributed relational database,TiDBsupports cross-instance transactions by using an optimized two-phase commit protocol (2PC) fromGoogle Percolator. In the practical implementation, we adopt a centralized control service --Placement Driver (PD)-- to allocate the monotonically increasing timestamps, same as Percolator which uses a timestamp oracle (TSO) service to do so. We decide to use TSO from the beginning and use PD to allocate the timestamp. The main reason is that it is very easy to implement a correct and high-performance TSO service.
Using TSO to allocate timestamp is simple, but it has the following disadvantages:
- Single point of failure
- Network latency
For problem 1, we embedetcdand use theRaftconsensus algorithm to make the service highly available and consistent.
For problem 2, because we have no way to break the law of physics, the network latency cannot be avoided.
There might be some concern that the network latency might impact performance. This concern is valid, but it depends on the user's specific deployment scenario:
-
If we deploy our entire TiDB cluster within one data center, the network latency is not a problem. In most cases, one network query only takes 0.1 or 0.2 ms. This is very fast and also means that using TSO can have lower latency than HLC or TrueTime.
-
If we deploy a TiDB cluster across multiple data centers, we still have some ways to overcome this problem. We can deploy TiDB servers and PD servers together so that getting timestamp is still fast with minimal impact on performance between the client and the TiDB server.
Summary
We don't rule out TrueTime or even HLC altogether, but given all the factors considered, we believe TSO is the best existing solution for us to solve the time synchronization problem right now even if it is not optimal.
Of course, if we could break the laws of physics, like transferring data at the superluminal speed, the network latency of TSO would no longer be a problem. But by the time we can do that, we may be racking our minds around how to synchronize time across multiple planets!