Saturday, August 26, 2017

Companies waste so much money in the cloud

Cloud services are what made DevOps possible: cloud services make it possible to dynamically provision infrastructure and services, via scripts, and that is what makes continuous delivery possible. Cloud services make it possible to dynamically scale, and that is what makes it possible for organizations to economically deploy Internet-scale applications without having to operate their own data center.

Still, any tool can be misused. Cloud services enable one to scale economically, adjusting application size based on demand; but cloud services can also be over used, resulting in huge unnecessary costs.  We have identified some of the way this typically happens with the goal being helping people ultimately right-size their cloud deployments.

Pay close attention to your development services and instances

Software development teams generally do not pay for their cloud accounts, and so they don’t have an incentive to manage the cost of those accounts. For example, it is not unusual for development teams to create VMs and leave them running, resulting in a large set of abandoned running VMs, running up the bill.

To avoid misuse of cloud services by development teams, you need someone who proactively manages those resources. Don’t take away the ability of teams to create resources in their cloud accounts; but instead, have someone who keeps track of what the programmers are doing, and what they are using the cloud resources for. That person should continually peruse resource usage, and ask, “What is this for? Do we need this?” Also, set up standards that make that job easier: e.g., establish a rule that all VMs should be tagged with their purpose and, if the VMs exist in a multi-project environment, also tag them with their project code - which helps to perform accounting. (Scaled Markets includes setting up such a tagging scheme as part of its Continuous Delivery Accelerator offering - see Accelerators.)

Beware third party services

Today, it is possible to outsource much of one’s application’s services, using so-called platform-as-a-service (PaaS) services, either provided by the cloud provider or by a third party. For example, instead of creating monitoring, one can use third party monitoring services, to verify that one’s app is still up. While this is convenient, it generally costs money for each check that gets performed, based on the frequency - so your app now has another taxi meter running, eating into your profits. As another example, most cloud providers now provide scalable messaging services that can feed your applications with event messages from IoT devices or from other applications. While the scalability of those services is a powerful value proposition, they generally are paid for per message - and so the bill can be quite large. Some PaaS services are priced based on their footprint, rather than on their usage: Azure’s HDInsight service is an example of that: you pay for an HDInsight cluster a fixed amount per hour, regardless of whether you are using it or not. Also, since it takes quite awhile to start up an HDInsight cluster, it is not something that you can just create on demand.

Design your apps for local testability

In a recent talk by Yunong Xiao of Netflix, about Netflix’s move to a PaaS model for the services that it provides to Netflix integrators such as Roku, Yunong mentions that they invest heavily in the ability of developers to “mock” components so that developers can perform integration testing (using mocks) before committing their code. This is something that is often lost on corporate teams that try to adopt DevOps: they find that their applications are not amenable to be stood up in a test mode in local integration test environments. In other words, their applications have not been “designed for testability”.

In order to be able to “shift left” one’s integration testing, so that it can be done upstream - ideally on developer laptops - prior to full scale integration testing by a QA team, it is necessary to design an application so that it can be stood up in a small footprint test mode. That’s what Yunong really means by “mocks”. For example, consider Apache Spark. Spark is used by Web services that need to implement extremely high volume stream processing. However, in order for a programmer to test Spark code, they don’t have to stand up a full scale Spark cluster: they can run Spark on their laptop in “standalone” mode - even in a Docker container (that is how we normally do it). One can also set the memory used by Spark: for production, one would set it to multiple Gigabytes, but for local integration testing, one can set it at something like 300M.

Use a cloud framework that can run locally

It is absolutely critical that when the architecture of a cloud-based application is conceived, that the architects and planners think through how it will be tested, and the costs involved. Again, this is design-for-testability.

For example, if your application uses PaaS services that are native to your cloud provider, can you run those services locally (in laptops or on-premises VMs)? - or are those PaaS services tied to the cloud - and therefore require you to pay for their use in each and every test environment?

This is a critical consideration, because in a continuous delivery process, testing is not only continuous, but it is also done with a-lot of parallelism. The parallelism occurs in two ways: (1) an integration test environment is often spun up whenever code is committed - often at a team or even single developer level - and, (2) regression test suites are often broken up into subsets that are run in parallel in order to reduce run time to a few hours or less. If you are paying for services being used by each test environment, you are going to be paying an awful lot of money.

That’s why frameworks such as Kubernetes, OpenShift, and Cloud Foundry let you create small local container clusters on a laptop, and there are also local “mini cloud” tools that can run on a laptop, so that you can first debug your application locally, and then deploy to a real cloud using, say Terraform. Lattice is one such “local cloud” system: in other words, you write your infrastructure code using a cloud-agnostic tool such as Terraform (which has a driver for Lattice), test it in your local cloud, and then deploy to a real cloud.

Why do this? It is so that individual developers can perform integration testing - before they commit their code changes. This is true “shift left” testing, and the result is very high developer throughput and a very short feature lead time. This is becoming standard DevOps practice today. However, for your developers to be able to do this, they need to be able to run the cloud services locally, and ideally for free. If you think that your application is too complex to ever run locally, think again: a typical laptop can run 20 containers without any problem. If you can’t run your integration tests locally, it is very likely that that is because your apps were not designed to be able to run locally in a test mode - and you can fix that with some re-engineering, and save a-lot of money.

Summary

Cloud services have enormous value for deploying flexible and scalable applications and services. However, it is essential to consider how those will be tested, and the costs involved. Applications must be designed to be testable; if you don’t carefully design a testing approach that economically uses your cloud services, and an application architecture that makes it possible to test economically (e.g., locally), then the cloud services can rapidly become a financial burden.

Cliff & Phil

No comments:

Post a Comment