Arun Manglick - Architect View: December 2020

This post is to cover parameters of resilient application, essential goals of many modern architecture exercises. A well-designed app provides high availability, reliability and scales up and down as demand increases and decreases, and is resilient enough to withstand service disruptions. Building and operating apps that meet these requirements requires careful planning and design.

Availability
Scalability
Reliability
Performance
Reduce Communication Between Micro Services
Use Appropriate DB and Storage Technology
Caching

More details below to know what is required for each of such parameters.

"Well before you reading thru below, there are multiple factors contribute to 'Resiliency of Microservices' -

Timely Timeouts - Do not indefinitely wait to timeout, if no response, else it will degrade system.
Circuit Breakers - Stop making requests stop making requests after a certain threshold of failures.
Bulkheads - Do not have single thread pool for all outbound endpoints. Rather one thread per endpoint
Steady State - Adhere to designs which allows your system to run in a steady state for long time. Could be thru Automated Deployments, Clearing Log Files to avoid growing them indefinitely, Clear Cache before growing them enormous etc.
Fail Fast - Make decision early to fail, if you know the request is going to fail/rejected. E.g. Failed Node. Even Circuit Breakers can also be used to implement Fail Fast Strategy.
Let it Crash - This strategy believes to abandon a broken sub-system, to preserve the overall stability of the system. E.g. Remove Failed Node, Remove Failed Endpoint etc
Load Shedding - Load shedding drops some proportion of load by dropping traffic as the server approaches overload conditions (e.g. Reduce Queue size, Introduce Caching etc).
Fallback - Sometimes a request is going to fail no matter how many times you retry. The Fallback policy lets you return some default or perform an action - Like paging an admin, scaling a system or restarting a service.
Disaster Recovery - Replicate Services at multiple AZs to handle site failures.

I'll write up a separate post - Designing Microservices - To cover all these details with few more colors"

Availability:

Eliminate SPOF (Single Point of Failure)

Implement Geographic Redundancy - At least two copy of every key component
Fault Isolation Zone

Data Replication
Ensure Automatic Failover -

Retry Logic,
Circuit Breakers (stop making requests stop making requests after a certain threshold of failures)

Load Balancing - With Health checks to avoid traffic to unhealthy nodes
Increase Monitoring - To determine failures early - Like New Relic, site24x7.com

Reliability:

More or less similar to Availability, as Availability increase Reliability
In Addition

Use Micro Services
Horizontally Scaling than Vertical Scaling to reduce SPOF

Scalability:

Layered Architecture
Loose Coupled Design/Components
Micro Services
Load Balancer - To route traffic to more available nodes.
Sharding

Horizontal Sharding - Available with No-SQL DB
Vertical Sharding - Available with SQL DB

Caching

Performance:

Asynchronous Programming
CDN (Content Delivery Network)
Caching
Prefer Static Content From Cache
Load Balancer - To Distribute Traffic Equally
Compress Data
Event Driven Designs
Message Bus
Reduced Image Size
Review SPs for Best Execution Plan
Layered Architecture - Allows to boost any particular layer to scale independently
Keep Instrumenting and Work on Weak Areas

Reduce Over Communication b/w Micro-Services:

This is required to stop traffic to failed services and avoid cascading failures
This can be done using

Circuit Breakers - Stop making requests stop making requests after a certain threshold of failures
Fallback - Sometimes a request is going to fail no matter how many times you retry. The Fallback policy lets you return some default or perform an action - Like paging an admin, scaling a system or restarting a service.
Graceful Degradations -

Load shedding drops some proportion of load by dropping traffic as the server approaches overload conditions (e.g. Reduce Queue size). The goal is to keep the server from running out of RAM, failing health checks etc.
Graceful degradation - Takes the concept of load shedding one step further by reducing the amount of work that needs to be performed. In some applications, it’s possible to significantly decrease the amount of work or time needed by decreasing the quality of responses. For instance, a search application might only search a subset of data stored in an in-memory cache rather than the full on-disk database or use a less-accurate (but faster) ranking algorithm when overloaded.

Appropriate DB and Storage Technology

SQL Provides more Data Consistency
No-SQL provides more Scalability - Horizontal Scaling
If application does not require all features of RDBMS and can live with 'Eventual Consistency', No-SQL is recommended for Better Availability & Scalability

Caching Implementation:

Having Caching reduces Load and Thus increases Scalability and Availability - By reducing reliance of Disk-Based Storage
Reduces Load on down-stream services, specifically DB calls
Increases Resiliency by Supporting Techniques like Graceful Degradation

Few Reference(s):

Resilient App Patterns - https://cloud.google.com/solutions/scalable-and-resilient-apps
Graceful Degradation - https://sre.google/sre-book/addressing-cascading-failures/#xref_cascading-failure_load-shed-graceful-degredation

Application Design Practices:

Below are few parameters, good to consider while designing application:

Single Responsibility
Open Closed Principles
Interface Segregation
Deep Error Catching
Retry Logics - Reference Polly Framework - https://www.pluralsight.com/blog/software-development/intro-to-polly
Evaluate Dependencies and Dependency Failures
Evaluate Scalability

Within Limit
Beyond Limit - Archival, Horizontal/Vertical Fragmentation

Content Delivery Network
Implement Automatic Failover -

Retry Logic,
Circuit Breakers (stop making requests stop making requests after a certain threshold of failures)

High Cohesion & Loose Coupling
Layered Design & Separation of Concern
Implement Angular SPA & Avoid Round Trips
Avoid Long Running Threads
Avoid Unnecessary Exception
Prefer Stateless Services - Better Performance, No Server Affinity, More Scalable
Keep Session Size Low
Prefer No-SQL - To Enable Horizontal Scaling
Build Instrumentation

Hope this helps..

Arun Manglick

Arun Manglick - Architect View

Tuesday, December 1, 2020

Resilient Application & Design Aspects