Blink Identity - High throughput, privacy preserving identification service.

View Original

Developing Fault Tolerant Systems

The scope of environments that we deploy software to at Blink Identity is incredibly wide and continues to grow each day. From bare metal systems to mobile devices to various levels of virtualization and cloud environments, our flexibility allows us to rapidly build and deploy solutions for a broad range of problems. As our deployment landscape continues to evolve however, we are met with an increasing number of ways that both systems and the environments they are deployed to can fail. 

We’d like to present a high level summary of ideas that we follow in our development process that help contain and adapt to various modes of failure. These ideas guide the tools that we select to use, the structures in which we build software, and the processes by which we deploy and maintain our solutions.

Separation Of Concerns

The solutions that we build are generally composed of a set of independent components, where those components are oftentimes composed of smaller components. When deciding how to scope each component, a primary decision driver is ensuring a separation of concerns. This means that a component should only require the minimum information and functionality required to perform the task that it is assigned to do.

Separation of concerns contributes to fault tolerance by isolating the effect of a component’s failure from other unrelated components, ensuring that there is no “single point of failure” that can cascade into a complete system outage.

To give an example that we observed, a recent DNS outage at Cloudflare interrupted our ability to communicate with a ticketing provider that we had integrated with on behalf of a customer. While this outage caused an explicit failure in our component that was pulling ticket information, it had no effect on the components responsible for enrolling or matching users.

Idempotency

“The Two Generals Problem” is commonly referred to when illustrating the problems involved in ensuring reliable communication over an unreliable channel. When two systems need to communicate with each other across a network, it can very quickly become difficult to verify that a message was received once, and only once.

The idea of idempotency is that systems should be able to gracefully handle the same message being delivered multiple times. Suppose you publish a tweet on Twitter, but your cellular network drops out shortly after you hit the send button. The Twitter server may have received your tweet but you aren’t sure so you hit send again. You now see two identical tweets, showing this as an example of a non idempotent process. The overall outcome changes based on how many times you hit that send button. Let’s revisit a slightly modified example where instead of publishing a tweet you are attempting to delete a tweet that you have previously made. You hit the delete button but due to an unreliable connection you can’t be sure that the tweet was actually deleted so you hit it again. No matter how many times that delete button is hit, the outcome is the same as long as the Twitter server received at least one of the delete messages. This is an example of an idempotent process.

At Blink Identity we are regularly faced with having to communicate over unreliable channels so we build our systems with idempotency in mind whenever possible. We utilize various network communication frameworks that allow us to guarantee “at least once” delivery of messages. While this introduces the possibility of messages being delivered more than once, it ensures that messages are not lost due to network unreliability and the idempotency ensures that when a message is received multiple times it does not result in a failure state in the system.

The ability to tolerate both expected and unexpected failures in a clear and controlled manner is a cornerstone of our software development process. Not only does it minimize the required maintenance and support necessary for the product, it also allows us to continue to push boundaries on the platforms and environmental constraints that our product can operate under.