Day 05: What is azure data centers and Service-level Agreements (SLA)

"I'm a 3rd-year Computer Engineering student at Marwadi University with skills in C++, web development (MERN stack), and DevOps tools like Kubernetes. I contribute to open-source projects and share tech knowledge on GitHub and LinkedIn. I'm learning cloud technologies and app deployment. As an Internshala Student Partner, I help others find jobs and courses." now currently focusing on #90DaysOfDevops
Azure Data Centers
Azure provides more than 100 redundant & secure facilities worldwide linked with a network.
Allows you to
gain global reach with local presence
keep your data secure and compliant with local laws
You can pick the region and sometimes availability zone you want resources deployed into.
- ❗You can't select a specific datacenter or location within a datacenter.
Regions
Regions = Contains at least one, but often multiple datacenters that are nearby and networked together with a low-latency network.
Azure assigns and controls the resources within each region to ensure workloads are appropriately balanced.
E.g. West US, Canada Central, West Europe, Australia East, and Japan West.
❗Some services or virtual machine features are only available in certain regions, such as specific virtual machine sizes or storage types.
Azure regions as of February 2020:
💡Regions provide better scalability, redundancy, and preserves data residency for your services.
Read more: Azure regions
Special regions
For compliance or legal purposes.
Azure Government
US DoD Central, US Gov Virginia, US Gov Iowa and more
📝 Physical and logical network-isolated instances of Azure for US government agencies and partners.
China East, China North and more
Unique partnership between Microsoft and 21Vianet
Microsoft does not directly maintain the datacenters.
Geographies
Each region belongs to a single geography
Defined by geopolitical boundaries or country borders.
Has specific service availability, compliance, and data residency/sovereignty rules applied to it
Fault-tolerant to withstand complete region failure through their connection to dedicated networking infrastructure
- 📝 Fault-tolerance: App ability to self-detect and correct all types of problems in its environment
Data residency
Defines the legal or regulatory requirements imposed on data
Based on the country or region in which it resides
💡 An important consideration when planning out your application data storage.
Geographies are broken up into the following areas
Americas
Europe
Asia Pacific
Middle East and Africa
Read more: Azure geographies
Availability Zones
📝 Physically separate datacenters within an Azure region.
💡 Allows you to make applications highly available through redundancy.
Replicate your compute, storage, networking, and data resources in other zones.
Costs more
Primarily for VMs, managed disks, load balancers, and SQL databases
Zonal services: Pin resource to a specific zone.
Zone-redundant services: Replicates automatically across zones.
Have independent power, cooling, and networking
Set up to be an isolation boundary
- If one zone goes down, the other continues working
Identified as 1-2-3
Logically mapped to the actual physical zones for each subscription independently.
Availability Zone 1 in a given subscription might refer to a different physical zone than Availability Zone 1 in a different subscription.
Connected through high-speed, private fiber-optic networks.
❗There are regions that do not support (multiple) availability zones
Region Pairs
Each Azure region is always paired with another region within the same geography
- E.g. West US paired with East US, and South East Asia paired with East Asia
Pairs are at least 300 (≈ 500 km) miles away.
Allows for the replication of resources, e.g. virtual machine storage
- Some services offer automatic geo-redundant storage using region pairs.
Reduce the likelihood of interruptions to both regions
- E.g. natural disasters, civil unrest, power outages, or physical network outages
If one region fails, services automatically fail over to the other region in its region pair.
Data continues to reside within the same geography as its pair (except for Brazil South) for tax and law enforcement jurisdiction purposes.
If there's an extensive Azure outage =>
- One region out of every pair is prioritized to make sure at least one is restored as quick as possible,
Planned Azure updates are rolled out to paired regions one region at a time to minimize downtime and risk of application outage.
Service-level Agreements (SLA)
Formal documents to define the performance standards that apply to Azure.
Specify also what happens if a service or product fails to perform to a governing SLAs specification.
There are SLAs for individual Azure products and services.
❗ Azure does not provide SLAs for most services under the Free or Shared tiers
- e.g. Azure Advisor
Three key characteristics of SLAs for Azure products and services:
Performance Targets
Specific to each Azure product and service.
E.g. uptime guarantees or connectivity rates
Uptime and Connectivity Guarantees
📝 Monthly Uptime % =
(Maximum Available Minutes-Downtime) / Maximum Available Minutes X 100📝 Range from 99.9% ("three nines") to 99.999% ("five nines") for any paid tier service.
- In other words minimum SLA for all non-free Azure services are 99.9%
E.g. Azure Cosmos DB (Database) service SLA offers 99.999 percent uptime
meaning it allows for about 5 minutes of total downtime per year.
also includes low-latency commitments of less than 10 milliseconds on DB read + write operations.
📝 Service credits
Given to paying Azure customers if uptime percentage is lower than given in SLA.
Describe how Microsoft will respond if an Azure product or service fails to perform to its governing SLAs specification.
E.g. customers may have a discount applied to their Azure bill, as compensation for an under-performing Azure product or service.
Read more: SLA Summary for Azure Services
Composite SLA
Result of combining SLAs across different service offerings.
📝 Calculating downtime
E.g. web app (99.95% SLA from Azure) writes to SQL database (99.99% SLA from Azure)
Composite SLA =
99.95 percent × 99.99 percent = 99.94 percent- \=
0.9995 * 0.9999 = 0.9994
- \=
Means combined probability of failure is higher than the individual SLA values
You can improve the composite SLA by creating independent fallback paths.
E.g. if the SQL Database is unavailable, you can put transactions into a queue for processing at a later time.
Web app (99.95%) writes to either SQL Database (99.99%) or queue (99.9%)
Application is still available even if it can't connect to the database.
- ❗But it fails if both the database and the queue fail simultaneously.
If the expected percentage of time for a simultaneous failure is 0.0001 × 0.001
the composite SLA for this combined path of a database or queue would be:
1.0 − (0.0001 × 0.001) = 99.99999 percent
If we add the queue to our web app, the total composite SLA is:
99.95 percent × 99.99999 percent = ~99.95 percentImproves SLA but application logic gets more complicated
- You are paying more to add the queue support and there may be data-consistency issues you'll have to deal with due to retry behavior.
Application SLA
By creating your own SLAs, you can set performance targets to suit your specific Azure application.
💡 >= four 9's (99.99%) SLA performance targets =>
manual intervention from failures may not be enough (difficult to be quick enough)
should have self-diagnosing & self-healing solutions.
Resiliency
Resiliency is the ability of a system to recover from failures and continue to function.
High availability and disaster recovery are two crucial components of resiliency
- 📝 Disaster recovery: When Godzilla destroys your data center, you do have alternative locations to keep providing your service and protocols/means for the other location to know how to keep delivering the service.
Failure Mode Analysis (FMA)
Goal:
Identify possible points of failure.
Define how the application will respond to those failures.
Read more: Designing resilient applications for Azure
High availability
📝 Availability is often given as percentage uptime
Refers to the time that a system is functional and working.
Most providers prefer to maximize the availability of their Azure solutions by minimizing downtime.
❗ As you increase availability, you also increase the cost and complexity of your solution.
As your solution grows in complexity, you will have more services depending on each other.
You might overlook possible failure points in your solution if you have any interdependent services.
💡E.g. a workload that requires 99.99 percent uptime shouldn't depend upon a service with a 99.9 percent SLA.
Read more: Availability choices for Azure compute




