Perception matters

After I started passing as male, both online and off, I was forced to change how I interacted with others. Keep in mind, I’ve always been an opinionated person, which should come as no surprise to…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




How to improve Reliability in the cloud?

Well reliability is a derived concept. Basically you want your application to be available for your customer. Also you should ask what latency is acceptable for your application. If your application is slower or normal, your application users should not keep retrying to finish any transaction. How should your application fidelity be while your application is slow and not rendering properly. So in order to define reliability of your application you have to answer how much your application should be Available and what is your Latency that you want for your application.

So basically reliability is something your Business has to define. You must talk to your customer and understand what kind of business they are running, what problem they want to solve, who are the target users for the application.

Reliability of your workload is a shared responsibility between you and your cloud provider. Platform Reliability such as datacenter, network and other hardware infrastructure is your cloud provider responsibility. However, application reliability is customer responsibility.

In order to increase application reliability you must do below:

Make sure you choose the right infrastructure building blocks in the cloud to protect their reliability impacts. You have to isolate the concerns and identify them like below:

In order to protect VM disk and OS issues you need to use Premium Storage from Azure Cloud. To protect from hardware failure you must create a VM inside the Availability sets. To protect from entire data center failure you must consider deploying your workload in multiple Availability Zones. Also to protect from natural disasters make sure you deploy your workload in another region as well. So in summary you have to choose correct building blocks as per your requirement to improve reliability.

Start backup for your virtual machine, databases that you can enable for existing resources in Azure cloud. However, if your VMs are not in the Availability set then you can not put them in the availability set for existing resources. You have to do workaround and re-build some of them and put them in Availability sets again.

In order to improve reliability of existing workload you must follow below steps on each use case:

First you should talk to customers to find out which business capability they want to be up for what time. Example the purchase application written as asp.net needs to be up for 99.99 and 4 min per month down time. If business defines this then next you must learn their existing workload design consideration to achieve 99.99 SLA.

Designing for failure is the philosophy to take while designing applications in the cloud. Exploring what happens if something fails? Take the architecture, look at the portions of the application piece by piece and apply failure mode analysis. Break this down into your solution as a series of key processes. Like in your application, the customer goes and searches for an item to buy.

So now let’s search what component has to be working to make search possible?

Fo below architecture example in order to make sure Search is working below components must be up and running

Try to write down each component failure mode and their effect, impact and chances and work with your customer to learn more. This is a collaborative effort. Solutions Architect and Customer together must do this to come up with correct decisions. For search use cases you want to check what happens if AKS is unavailable, Website is slow, CosmosDB unavailable for read and update use case.

AKS SLA is 99.95% and Cosmos DB SLA is 99.99%

So overall composite SLA of AKS and Cosmos DB is 99.99 x 99.95 = 99.94%

Composite SLA = 0.99999975 x 0.9999 x 0.99999 = 99.99%

Create alert rules specify conditions and for actions you can alert humans with sms, or create action groups and execute some azure functions.

In addition to backup lots of customers are focused on disaster recovery and making sure the workloads are redundant and not just backed up. Meaning that if we had a planned or unplanned type of outage. Planned means upgrades to different applications, or we are testing our disaster recovery or may be unplanned like some natural disaster or things of nature. So introduce resilience in the form of being able to access backups that are in another region and to make those come online as quickly as possible to make sure that we reduce downtime to the extent possible.

For production workloads, especially Big Data, SQL and other transactional DB upgrades to premium disk instead of standar disk offering. Premium disks also offer bursting capabilities.

Burst means performance on heavy load.

Un-Managed Disks where you have to set the size, throughput limit, and add more resources to scale are called unmanaged disks.

However in Managed Disk you get below processes automated for you by your cloud provider:

You have to have a long term learning process. Just wait for another disaster, unexpected event and learn. The other option is to simulate/force failures and see what happens. Say stop AKS and check if your redundancy is working. Take a more experimental model, introduce real world scenarios and see how our workload works. Game day, setup scenario, one team generating un-predictable load by un-usual way, another team observing, are we seeing alerts are coming, are we seeing redundant machine status.

In order to get Reliability increased for your customers workload you must speak with them and understand their critical use case which they care most about. Follow the steps that I suggested in this article to come up with the desired reliability state defined by business by doing failure mode analysis in depth one by one for each important use case. You may have to teach your customer about premium disks for VM and their SLA improvement because of upgrading to premium disk. What is managed and unmanaged disks after migrating to managed disk resilience can be increased since it automatically replicates the disk into 3 multiple hardware devices. Consider replicating virtual machines to protect them from regional outages. Setup Azure Virtual Machine backup for improved reliability and to protect from human error or DB corruptions. Also make sure you set up alerts for your applications to send you email or create zendesk tickets whenever it is down. I hope now you have a good understanding about reliability improvement.

Thanks for reading my article till end. I hope you learned something special today. If you enjoyed this article then please share to your friends and if you have suggestions or thoughts to share with me then please write in the comment box.

Add a comment

Related posts:

Cliche Mountain Learnings

I climbed a mountain and turned around because, honestly, the options are limited at the top. Plus, it’s cold. And Windy. Why is there so much god dammed wind? No one ever talks about the wind, and…

A Bug In Computer Technology

Every field of studies have their own definition of what a bug is, for instance, a micro-biologist might define a bug as a “harmful microorganism” typically a bacterium. In an informal setting, a bug…

Time Management for People With Bad Boundaries

That is not my normal response to a cancelled beach trip. I love the beach. I need to be rolled in sand and doused with salt water at least once a week in the summer to irrigate my pores and wash…