1. 程式人生 > >Why status codes matter in data delivery · Segment Blog

Why status codes matter in data delivery · Segment Blog

Segment is a hub for a tremendous amount of data. It processes peaks of 230,000 events per second inbound, and 280,000 events per second outbound between more than 200 integration partners. You may think of Segment as black box for delivering all this data. You send data once to its tracking API, and it coordinates translating data and delivering it to many destinations.

When everything works perfectly, you don’t need to open the black box. Unfortunately, the world of data delivery at scale is far from perfect. Think of all the software, networks, databases and engineers behind Segment and our partners. You can imagine at any given time a database is failing, a network is unreachable, etc.

Segment engineering has taken lengths to operate reliably in this environment. Our latest efforts have been around visibility into the HTTP response codes from destinations. We spent the last few months adding hooks to measure everything from the volume of events, how quickly they are sent to destinations, and what HTTP status code and error response body, if any, occurred for every request.

This instrumentation is ultimately for Segment users to see into the black box to answer one question: how do delivery challenges affect my data? To this end, we built an event delivery dashboard around the data.

It turns out the data in aggregate is also tremendously useful to cloud service engineers at Segment and our destination partners alike. Looking at HTTP status codes alone has unveiled lots of insights on how data flows between services and how we can maximize delivery rates.

I’d like to share some of the things we have found in a day of HTTP responses that we see at Segment.

Success!

First the good news. 92.6% of events — 24.4 billion on the sample day — are delivered on the first attempt. In this happy path, Segment makes an HTTP request to a destination and receives a HTTP 2XX success status code response.

Terminal problems

Next, the bad news. 5.5% of events — 1.4 billion on the sample day — never make it to their destination. In this path, Segment makes an HTTP request to a destination and generally receives an HTTP 4XX client error status code response. These codes indicate the client — either Segment or the user it represents — made an error that the server can’t reconcile.

What’s the password?

The most common client errors Segment sees are HTTP 401 Unauthorized and HTTP 403 Forbidden on 3.8% of requests. In this case, the server doesn’t recognize the given username, password or token, and can’t accept any data. Neither Segment nor the destination server can resolve this automatically for a given request.

This is either due to wrong credentials configured in Segment in the first place or credentials that expired on a destination. Segment always attempts to send the latest events just in case the problem was resolved on either side.

No comprende

The next most common client error is HTTP 400 Bad Request on 0.51% of requests. In this case, the received the request payload but couldn’t make sense of it. These are generally validation errors. Again, Segment and the destination can’t do anything about it automatically, except show instructive error messages to the user.

{
  "error": {
    "message": "(#100) Value of param custom_events[0][_eventName] must match regex pattern: /^[0-9a-zA-Z_][0-9a-zA-Z _-]{0,39}$/",
    "code": 100
  }
}

Next steps…

These errors are considered fatal, but the qualitative data can inform ways to improve delivery over time. The first big step here was building the event delivery dashboard to surface these issue to users.

For authentication errors, a logical next step would be to send notifications when delivery begins to encounter 401 errors. We can also imagine a mechanism to disable event delivery after a threshold to spare partners the request overhead.

For validation errors, visibility into requests per-customer and per-destination can inform improvements to the Segment integration code. Segment can review partner API requirements and not attempt to deliver data it can determine is bad ahead of time, or automatically massage data to conform to the destination API.

Ephemeral problems

Now the interesting challenge… a large class of HTTP problems on the internet are not fatal. In fact, most of the HTTP 5XX server error status codes reflect an unexpected error and imply that the system may accept data at a later time, as does one critical 4XX status code.

Volume

The largest class of temporary problems seen by Segment are of the HTTP 429 -- Too Many Requests class. It’s not hard to imagine why… 

Segment itself has very high rate limits with the aim of accepting all of the data a customer throws at it. Not every downstream destination has the same capabilities, particularly those that are systems of record. Intercom, Zendesk, and Mailchimp, for example, all have well-designed and lower API rate limits.

Segment has to mediate between the customer data volume and the destination rate limits. A combination of internal metering, request batching, and retry with backoff get most of the data through.

But about 7.3% of requests — 2.1 billion a day — encounter a 429 response along the way. Retries help a lot, but if a customer is simply over their limits consistently over a long enough time frame, Segment has no choice but to drop some messages. At least we can quantify how much this is happening with the delivery data and report this to a customer.

Out of service

The next largest class of error — 1.3% of requests — is from destination servers. Segment often sees servers respond with an error like:

  • HTTP 502 Bad Gateway

  • HTTP 504 Gateway Timeout

  • HTTP 500 Internal Server Error

  • HTTP 503 Service Unavailable

Perhaps it’s a temporary glitch for a single request, or perhaps the destination service is experiencing an outage. But every day Segment encounters 371 million of these error responses.

Unreliable channel

Finally, 1.1% of requests error out because of the network layer. At scale, Segment sees a noticeable number of network errors, such as:

  • ENOTFOUND — hostname not found

  • ECONNREFUSED — connection refused

  • ECONNRESET — connection reset

  • ECONNABORTED — connection aborted

  • EHOSTUNREACH — host unreachable

  • EAI_AGAIN — DNS lookup timeout

Maybe it’s due to bad host, flaky network, or DNS error.

If at first, you don’t succeed…

As seen above, a significant number of HTTP or network status codes indicate transient problems. When Segment encounters these, it retries delivery over a 4-hour window with exponential backoff. We can see that this retry strategy is successful. We go from 92.6% success on the first attempt to 93.9% success after ten attempts, an extra 163 million events delivered, all thanks to the destination server sending proper HTTP status codes.

WTF webhooks

Finally, we see some bizarre errors. A very popular destination is webhooks — arbitrary HTTP addresses to POST events. The error codes we see from these destinations imply webhooks might not always follow best practices.

We see every number from 1  to 101 as HTTP status code, which is far outside the HTTP status code specification. Perhaps this is someone testing Segment delivery rates themselves?

We see HTTP 418 I'm a teapot which is in the HTTP spec as an April fools joke.

We see normal HTTP SSL errors like CERT_HAS_EXPIRED and more esoteric like UNABLE_TO_VERIFY_LEAF_SIGNATURE and DEPTH_ZERO_SELF_SIGNED_CERT.

Unfortunately, all of these strange responses are considered terminal errors by Segment. Sorry, webhooks!

Conclusion

It’s literally impossible to achieve 100% delivery on the first attempt over the internet. Transient network errors, unexpected server errors, and rate limiting all present challenges that add up to significant problems at scale. On top of that, encryption, authentication and data validation add another layer of challenges for perfect machine-to-machine delivery.

Retries are the primary strategy to improve delivery, and a retry strategy can only be as good as the destination service response codes.

As a service provider, returning status codes like 400, 403 or 501 is a powerful signal that Segment has no choice but to drop data. Inversely, returning status codes like 500, 502, and 504 is strong hint that Segment should try again. And 429 — rate limit exceeded — is an explicit sign that Segment needs to retry later.

If you’re running cloud service APIs or writing webhooks, think carefully about HTTP status codes. User data depends on it!

For more information about cloud service APIs, visit Segment’s Destinations catalog at https://segment.com/catalog#integrations/all