Challenges
Designing cloud-native applications following a microservices approach requires thinking differently about how to build, deploy, and operate them. We can’t just build our application thinking we
know all the ways it will fail and then just prevent those. In complex systems like those built with microservices, we must be able to deal with uncertainty. This section will identify five main things to keep in mind when developing microservices. Design for Faults
In complex systems, things fail. Hard drives crash, network cables get unplugged, we do maintenance on the live database instead of the backups, and VMs disappear. Single faults can be propagated to other parts of the system and result in cascading failures that take an entire system down.
Traditionally, when building applications, we’ve tried to predict what pieces of our app (e.g., n-tier) might fail and build up a wall big enough to keep things from failing. This mindset is problematic at scale because we cannot always predict what things can go wrong in complex systems. Things will fail, so we must develop our applications to be resilient and handle failure, not just prevent it. We should be able to deal with faults gracefully and not let faults propagate to total failure of the system.
Building distributed systems is different from building sharedmemory, single process, monolithic applications. One glaring difference is that communication over a network is not the same as a local call with shared memory. Networks are inherently unreliable. Calls over the network can fail for any number of reasons (e.g., signal strength, bad cables/routers/switches, and firewalls), and this can be a major source of bottlenecks. Not only does network unreliability have performance implications on response times to clients of your service, but it can also contribute to upstream systems failure.
Latent network calls can be very difficult to debug; ideally, if your network calls cannot complete successfully, they fail immediately, and your application notices quickly (e.g., through IOException). In this case we can quickly take corrective action, provide degraded functionality, or just respond with a message stating the request could not be completed properly and that users should try again later. But errors in network requests or distributed applications aren’t always that easy. What if the downstream application you must call takes longer than normal to respond? This is killer because now your application must take into account this slowness by throttling requests, timing out downstream requests, and potentially stalling all calls through your service. This backup can cause upstream services to experience slowdown and grind to a halt. And it can cause cascading failures.
Design with Dependencies in Mind
To be able to move fast and be agile from an organization or distributed-systems standpoint, we have to design systems with dependency thinking in mind; we need loose coupling in our teams, in our technology, and our governance. One of the goals with microservices is to take advantage of autonomous teams and autonomous services. This means being able to change things as quickly as the business needs without impacting those services around you or the system at large. This also means we should be able to depend on services, but if they’re not available or are degraded, we need to be able to handle this gracefully.
In his book Dependency Oriented Thinking (InfoQ Enterprise Software Development Series), Ganesh Prasad hits it on the head when he says, “One of the principles of creativity is to drop a constraint. In other words, you can come up with creative solutions to problems if you mentally eliminate one or more dependencies.” The problem is our organizations were built with efficiency in mind, and that brings a lot of tangled dependencies along.
For example, when you need to consult with three other teams to make a change to your service (DBA, QA, and Security), this is not very agile; each one of these synchronization points can cause delays. It’s a brittle process. If you can shed those dependencies or build them into your team (we definitely can’t sacrifice safety or security, so build those components into your team), you’re free to be creative and more quickly solve problems that customers face or the business foresees without costly people bottlenecks.
Another angle to the dependency management story is what to do with legacy systems. Exposing details of backend legacy systems (COBOL copybook structures, XML serialization formats used by a specific system, etc.) to downstream systems is a recipe for disaster. Making one small change (customer ID is now 20 numeric characters instead of 16) now ripples across the system and invalidates assumptions made by those downstream systems, potentially breaking them. We need to think carefully about how to insulate the rest of the system from these types of dependencies.
Design with the Domain in Mind
Models have been used for centuries to simplify and understand a problem through a certain lens. For example, the GPS maps on our phones are great models for navigating a city while walking or driving. This model would be completely useless to someone flying a commercial airplane. The models they use are more appropriate to describe way points, landmarks, and jet streams. Different models make more or less sense depending on the context from which they’re viewed. Eric Evans’s seminal book Domain-Driven Design (Addison-Wesley, 2004) helps us build models for complex business processes that can also be implemented in software. Ultimately the real complexity in software is not the technology but rather the ambiguous, circular, contradicting models that business folks sort out in their heads on the fly. Humans can understand models given some context, but computers need a little more help; these models and the context must be baked into the software. If we can achieve this level of modeling that is bound to the implementation (and vice versa), anytime the business changes, we can more clearly understand how that changes in the software. The process we embark upon to build these models and the language surrounding it take time and require fast feedback loops.
One of the tools Evans presents is identifying and explicitly separating the different models and ensuring they’re cohesive and unambiguous within their own bounded context.
A bounded context is a set of domain objects that implement a model that tries to simplify and communicate a part of the business, code, and organization. For example, we strive for efficiency when designing our systems when we really need flexibility (sound familiar?). In a simple auto-part application, we try to come up with a unified “canonical model” of the entire domain, and we end up with objects like Part, Price, and Address. If the inventory application used the “Part” object it would be referring to a type of part like a type of “brake” or “wheel.” In an automotive quality assurance system, Part might refer to a very specific part with a serial number and unique identifier to track certain quality tests results and so forth. We tried diligently to efficiently reuse the same canonical model, but the issues of inventory tracking and quality assurance are different business concerns that use the Part object, semantically differently. With a bounded context, a Part would explicitly be modeled as PartType and be understood within that context to represent a “type of part,” not a specific instance of a part. With two separate bounded contexts, these Part objects can evolve consistently within their own models without depending on one another in weird ways, and thus we’ve achieved a level of agility or flexibility.
This deep understanding of the domain takes time. It may take a few iterations to fully understand the ambiguities that exist in business models and properly separate them out and allow them to change independently. This is at least one reason starting off building microservices is difficult. Carving up a monolith is no easy task, but a lot of the concepts are already baked into the monolith; your job is to identify and carve it up. With a greenfield project, you cannot carve up anything until you deeply understand it. In fact, all of the microservice success stories we hear about (like Amazon and Netflix) all started out going down the path of the monolith before they successfully made the transition to microservices. Design with Promises in Mind
In a microservice environment with autonomous teams and services, it’s very important to keep in mind the relationship between service provider and service consumer. As an autonomous service team, you cannot place obligations on other teams and services because you do not own them; they’re autonomous by definition. All you can do is choose whether or not to accept their promises of functionality or behavior. As a provider of a service to others, all you can do is promise them a certain behavior. They are free to trust you or not. Promise theory, a model first proposed by Mark Burgess in
2004 and covered in his book In Search of Certainty (O’Reilly, 2015), is a study of autonomous systems including people, computers, and organizations providing service to each other.
In terms of distributed systems, promises help articulate what a service may provide and make clear what assumptions can and cannot be made. For example, our team owns the book-recommendation service, and we promise a personalized set of book recommendations for a specific user you may ask about. What happens when you call our service, and one of our backends (the database that stores that user’s current view of recommendations) is unavailable? We could throw exceptions and stack traces back to you, but that would not be a very good experience and could potentially blow up other parts of the system. Because we made a promise, we can try to do everything we can to keep it, including returning a default list of books, or a subset of every book. There are times when promises cannot be kept and identifying the best course of action should be driven by the desired experience or outcome for our users we wish to keep. The key here is the onus on our service to try to keep its promise (return some recommendations), even if our dependent services cannot keep theirs (the database was down). In the course of trying to keep a promise, it helps to have empathy for the rest of the system and the service quality we’re trying to uphold.
Another way to look at a promise is as an agreed-upon exchange that provides value for both parties (like a producer and a consumer). But how do we go about deciding between two parties what is valuable and what promises we’d like to agree upon? If nobody calls our service or gets value from our promises, how useful is the service? One way of articulating the promise between consumers and providers is driving promises with consumer-driven contracts. With consumer-driven contracts, we are able to capture the value of our promises with code or assertions and as a provider, we can use this knowledge to test whether we’re upholding our promises.
Distributed Systems Management
At the end of the day, managing a single system is easier than a distributed one. If there’s just one machine, and one application server, and there are problems with the system, we know where to look. If you need to make a configuration change, upgrade to a specific version, or secure it, it’s still all in one physical and logical location. Managing, debugging, and changing it is easier. A single system may work for some use cases; but for ones where scale is required, we may look to leverage microservices. As we discussed earlier, however, microservices are not free; the trade-off for having flexibility and scalability is having to manage a complicated system.
Some quick questions about the manageability of a microservices deployment:
- How do we start and stop a fleet of services?
- How do we aggregate logs/metrics/SLAs across microservices?
- How do we discover services in an elastic environment where they can be coming, going, moving, etc.?
- How do we do load balancing?
- How do we learn about the health of our cluster or individual services?
- How do we restart services that have fallen over?
- How do we do fine-grained API routing?
- How do we secure our services?
- How do we throttle or disconnect parts of a cluster if it starts to crash or act unexpectedly?
- How do we deploy multiple versions of a service and route to them appropriately?
- How do we make configuration changes across a large fleet of services?
- How do we make changes to our application code and configuration in a safe, auditable, repeatable manner?
These are not easy problems to solve. The rest of the book will be devoted to getting Java developers up and running with microservices and able to solve some of the problems listed. The full, complete list of how-to for the preceding questions (and many others) should be addressed in a second edition of this book.