This post is to cover parameters of resilient application, essential goals of many modern architecture exercises. A well-designed app provides high availability, reliability and scales up and down as demand increases and decreases, and is resilient enough to withstand service disruptions. Building and operating apps that meet these requirements requires careful planning and design.
- Availability
- Scalability
- Reliability
- Performance
- Reduce Communication Between Micro Services
- Use Appropriate DB and Storage Technology
- Caching
More details below to know what is required for each of such parameters.
"Well before you reading thru below, there are multiple factors contribute to 'Resiliency of Microservices' -
- Timely Timeouts - Do not indefinitely wait to timeout, if no response, else it will degrade system.
- Circuit Breakers - Stop making requests stop making requests after a certain threshold of failures.
- Bulkheads - Do not have single thread pool for all outbound endpoints. Rather one thread per endpoint
- Steady State - Adhere to designs which allows your system to run in a steady state for long time. Could be thru Automated Deployments, Clearing Log Files to avoid growing them indefinitely, Clear Cache before growing them enormous etc.
- Fail Fast - Make decision early to fail, if you know the request is going to fail/rejected. E.g. Failed Node. Even Circuit Breakers can also be used to implement Fail Fast Strategy.
- Let it Crash - This strategy believes to abandon a broken sub-system, to preserve the overall stability of the system. E.g. Remove Failed Node, Remove Failed Endpoint etc
- Load Shedding - Load shedding drops some proportion of load by dropping traffic as the server approaches overload conditions (e.g. Reduce Queue size, Introduce Caching etc).
- Fallback - Sometimes a request is going to fail no matter how many times you retry. The Fallback policy lets you return some default or perform an action - Like paging an admin, scaling a system or restarting a service.
- Disaster Recovery - Replicate Services at multiple AZs to handle site failures.
I'll write up a separate post - Designing Microservices - To cover all these details with few more colors"
Availability:
- Eliminate SPOF (Single Point of Failure)
- Implement Geographic Redundancy - At least two copy of every key component
- Fault Isolation Zone
- Data Replication
- Ensure Automatic Failover -
- Retry Logic,
- Circuit Breakers (stop making requests stop making requests after a certain threshold of failures)
- Load Balancing - With Health checks to avoid traffic to unhealthy nodes
- Increase Monitoring - To determine failures early - Like New Relic, site24x7.com
Reliability:
- More or less similar to Availability, as Availability increase Reliability
- In Addition
- Use Micro Services
- Horizontally Scaling than Vertical Scaling to reduce SPOF
Scalability:
- Layered Architecture
- Loose Coupled Design/Components
- Micro Services
- Load Balancer - To route traffic to more available nodes.
- Sharding
- Horizontal Sharding - Available with No-SQL DB
- Vertical Sharding - Available with SQL DB
- Caching
Performance:
- Asynchronous Programming
- CDN (Content Delivery Network)
- Caching
- Prefer Static Content From Cache
- Load Balancer - To Distribute Traffic Equally
- Compress Data
- Event Driven Designs
- Message Bus
- Reduced Image Size
- Review SPs for Best Execution Plan
- Layered Architecture - Allows to boost any particular layer to scale independently
- Keep Instrumenting and Work on Weak Areas
Reduce Over Communication b/w Micro-Services:
- This is required to stop traffic to failed services and avoid cascading failures
- This can be done using
- Circuit Breakers - Stop making requests stop making requests after a certain threshold of failures
- Fallback - Sometimes a request is going to fail no matter how many times you retry. The Fallback policy lets you return some default or perform an action - Like paging an admin, scaling a system or restarting a service.
- Graceful Degradations -
- Load shedding drops some proportion of load by dropping traffic as the server approaches overload conditions (e.g. Reduce Queue size). The goal is to keep the server from running out of RAM, failing health checks etc.
- Graceful degradation - Takes the concept of load shedding one step further by reducing the amount of work that needs to be performed. In some applications, it’s possible to significantly decrease the amount of work or time needed by decreasing the quality of responses. For instance, a search application might only search a subset of data stored in an in-memory cache rather than the full on-disk database or use a less-accurate (but faster) ranking algorithm when overloaded.
Appropriate DB and Storage Technology
- SQL Provides more Data Consistency
- No-SQL provides more Scalability - Horizontal Scaling
- If application does not require all features of RDBMS and can live with 'Eventual Consistency', No-SQL is recommended for Better Availability & Scalability
Caching Implementation:
- Having Caching reduces Load and Thus increases Scalability and Availability - By reducing reliance of Disk-Based Storage
- Reduces Load on down-stream services, specifically DB calls
- Increases Resiliency by Supporting Techniques like Graceful Degradation
Few Reference(s):
- Resilient App Patterns - https://cloud.google.com/solutions/scalable-and-resilient-apps
- Graceful Degradation - https://sre.google/sre-book/addressing-cascading-failures/#xref_cascading-failure_load-shed-graceful-degredation
Application Design Practices:
Below are few parameters, good to consider while designing application:
- Single Responsibility
- Open Closed Principles
- Interface Segregation
- Deep Error Catching
- Retry Logics - Reference Polly Framework - https://www.pluralsight.com/blog/software-development/intro-to-polly
- Evaluate Dependencies and Dependency Failures
- Evaluate Scalability
- Within Limit
- Beyond Limit - Archival, Horizontal/Vertical Fragmentation
- Content Delivery Network
- Implement Automatic Failover -
- Retry Logic,
- Circuit Breakers (stop making requests stop making requests after a certain threshold of failures)
- High Cohesion & Loose Coupling
- Layered Design & Separation of Concern
- Implement Angular SPA & Avoid Round Trips
- Avoid Long Running Threads
- Avoid Unnecessary Exception
- Prefer Stateless Services - Better Performance, No Server Affinity, More Scalable
- Keep Session Size Low
- Prefer No-SQL - To Enable Horizontal Scaling
- Build Instrumentation
Hope this helps..
Arun Manglick