We use retry pattern when dealing a temporary failure(also called transient failure), and retrying will fix the issue.

Transient failures are temporary errors that occur in distributed systems. They can be caused by a variety of factors, such as:

  1. Network Problems: can be due to packet loss, routing errors, and timeouts.
  2. Software Failures: due to crashes, memory leaks, and crashes
  3. Resource failures: Temporary unavailability of a VM or disk, and Cold start problems in Serverless instances etc.
  4. Infrastructure failures: Such as power outages, and hardware issues.

We should handle transient failures, otherwise, these issues can cause decreased throughput, increased latency, outages, pipeline failures etc.

So, our code should be resilient enough to handle transient failures, using various stragegies listed below to overcome transient failures.

Retry Strategies:

  • retries the naive way
  • retry specific number of times and fail
  • backoff ( sometimes used, after the above fails, )

Lets talk in detail about these patterns:

1. Retry:

You keep hammering the server or a service or an API call, until the issue resolves. No delays, no time delays!

Examples can be database connections, failed request call between two services etc.

Drawing below shows how this works:

retry_pattern_mlops_logo

2. Retry Specific number of times:

We implement this strategy when we dont want to hammer the server. For example, we retry for 4 times and then stop.

Drawing below shows how this works:

retry_pattern_mlops_logo

3. Retry with Backoff:

We usually use Retry with Backoff, when the previous method fails where we retry for a specific number of times.

Whenever you face an issue, you keep retrying, with increased sleep delays.

We add increased sleep delays everytime, to reduce the load that we put on the server.

Drawing below shows how this works:

retry_pattern_mlops_logo

In the next blog post, we will discuss about Circuit Breaker pattern, which is a pattern that can be used to handle non-transient failures.


Published

Category

Distributed Systems

Tags

Contact