1. 程式人生 > > The challenge of data growth at Criteo

 The challenge of data growth at Criteo

As you can see in the graph above, the block count has suddenly increased by 24M. The pressure it added on the Namenode’s 150GB JVM Heap became too high and caused the outage.

We discovered that a single user activity caused that incident.

Its MapReduce job’s partitioner was not properly configured and created millions of blocks.

However, identifying and cleaning it up first required to restart the Namenode…

But how do you restart a Namenode when its JVM Heap is full and you don’t have more memory available?

Our FSImages were about ~13GB gzipped at that time, a Namenode startup took ~45 minutes.

Also, Namenodes were always busy handling thousands of RPC requests all the time.

Therefore, we first isolated the Namenode at the network level, and fine tuned the JVM to optimize its behaviour at startup time. This was a long process to wait 1 hour between each startup but eventually, we succeeded to restart Namenodes.

In the process of restarting Namenodes, we have also discovered that at least one transaction was corrupted in the edit logs. Hopefully, with 5 journal nodes and 2 Namenodes, we were able to find a valid FSImage and restart after

that corrupted transaction.

The next step was to identify the root cause, fix it, eventually restore the production JVM settings and start to gradually remove the network isolation.

That whole process took us 36 hours of rotation work.

Following that incident, we focused our efforts on ensuring that we would never have to cope with such big incident again.

The impact of logical raid on Namenodes

Not long after that incident, we experienced another issue on our PA4 Namenodes.

At that time, we used logical raid on spin disks to store FSImage.

At some point, logical raid started affecting the cluster when raid was doing sanity check even though it was supposed to do it on idle IOPS only.

That incident decided us to improve our Namenode hardware.

The road to a 600M objects Namenode

We started to think about ways to secure our production to avoid such incidents in the future.

Our first decision was to upgrade the Namenode hardware. SSD disks, faster CPUs — because of a lot of single threaded code caused minor outages — and much more memory in order to keep a handful twenty of GBs at all time available if the JVM ever needed it.

Beyond that, we made two important changes in our daily work:

  • looking more closely to the Hadoop source code and gradually backport more and more patches to keep our production in good conditions
  • switch our attention to JVM performances, both for our infrastructure and to ensure our clients are using our resources in a good way

Moving to a specialized JVM

Focusing more on JVM performances led us to further optimize the Namenode JVM. With our FSImages, it was critical in order to secure failovers and restarts but also to keep a low RPC response time, no matter when.

Tuning Parallel and CMS GCs

Well tuned, the Parallel GC, allowed us to achieve 18 minutes startup, 10 seconds GC with 20 seconds standard deviation and a rate of 1 GC every 6 minutes.

The CMS GC, allowed us to achieve, 10 minute startup, 2 seconds GC with 3 seconds standard deviation and a rate of 1 GC every 3minutes.

Tuning G1 GC

The last one was G1 GC. It allowed us to achieve 11 minute startup, ~630ms GC with 2 seconds standard deviation and a rate of 1 GC every 30 seconds + 1 long mixed GC (30 seconds-1 minute) once per day.

Clearly that was the best deal for our low RPC response time target.

Later we realized that G1 GC with the Namenode had a large native memory footprint which was preventing us from using more than 330GB from our 512GB memory machine.

We also observed a bad behaviour with G1 GC when the number of blocks increased, leading to new tuning work to be done and again when the block count decreased.

Once or twice a day, there was also a series of Mixed GC taking up to 2–3 min. Further tuning to reduce this mixed GC led to an increase in young GC.

We decided that this recurrent tuning work depending on the cluster’s activity was not sustainable overtime.

We needed better solution.

Moving to Azul Zing GC

The Zing GC, requires few configuration. Our tests showed an increased CPU usage and a slower response time compared to G1 but on average, performances are similar enough.

The good thing is a lower native memory footprint, allowing us to vertically scale our Namenode heap a bit more if needed and a much more stable GC duration.