Non-transient(non temporary) failures in distributed systems are issues that usually persist for a very long time, may take longer to recover, and can even make the system unavailable.
We already discussed about retry pattern in the previous blog post, but retry pattern cannot handle non-transient failures.
The problem with non-transcient failures is that it can make many services go down when they are dependent on each other.
For example, look at the below image where Service 3 & 2 requires Service 1 to succeed and Service 4 requires Service 3 to succeed.
This is deadly and can cause entire system to break down.
It is not just the database that might long time to recover. It can also be that the service that has to respond is hammered with lot of requessts.
Some common causes of non-transient failures include:
- Network Congestion
- Reboot failures
- Resource blockages/ Exhaustion, when other services are utilizing a service that we might be calling. During this resource exhaustion can happen such as CPU, memory, and storage are not sufficent anymore.
- Outages that last very long
- Dependency failures
- Hardware issues, such as network cable was severed etc.
It is not advised to keep retrying during non-transient failures, because we put a lot of load on the servers, waste lot of computing resources, and we are just waiting out for the issue to resolve.
So how do we handle the non-transcient failures?
- We stop requests to the failing server for some time until it recovers.
Unlike retry pattern, a Circuit Breaker sits in between the client & the server.
Circuit breaker will allow the failing component, some time to recover, thereby also conserving resources.
A Circuit Breaker can be at different types of states:
- Closed circuit:
- This is the first state of the Circuit breaker where it is closed and it is allowing request to pass from one service to another service, and the communication/response is successful.
- Open circuit:
- This can be the second state of the Circuit breaker where two services are unable to communicate, as in response received is failure. So, the circuit is open disallowing the connection from one service to another.
- Half Open circuit:
- This can be the third state, and occurs only when Open Circuit occurs.
- This state can be used for testing, where a Service sends test request to another service to test the connection and if it is working, then the state goes to Closed circuit state or else, it will remain in the Open circuit state.