Reducing cloud waste by optimizing Kubernetes with machine learning
The cloud has become the de facto standard for application deployment. Kubernetes has become the de facto standard for application deployment. Optimally tuning applications deployed on Kubernetes is a moving target, and that means applications may be underperforming, or overspending. Could that issue be somehow solved using automation?
That’s a very reasonable question to ask, one that others have asked as well. As Kubernetes is evolving and becoming more complex with each iteration, and the options for deployment on the cloud are proliferating, fine-tuning application deployment and operation is becoming ever more difficult. That’s the bad news.
The good news is, we have now reached a point where Kubernetes has been around for a while, and tons of applications have used it throughout its lifetime. That means there is a body of knowledge — and crucially, data — that has been accumulated. What this means, in turn, is that it should be possible to use machine learning to optimize application deployment on Kubernetes.
StormForge has been doing that since 2016. So far, they have been targeting pre-deployment environments. As of today, they are also targeting Kubernetes in production. We caught up with CEO and Founder Matt Provo to discuss the ins and outs of StormForge’s offering.
Optimizing Kubernetes with machine learning
When Provo founded StormForge in 2016 after a long stint as a product manager at Apple, the goal was to optimize how electricity is consumed in large HVAC and manufacturing equipment, using machine learning. The company was using Docker for its deployments, and at some point in late 2018 they lifted and shifted to Kubernetes. This is when they found the perfect use case for their core competency, as Provo put it.
One pivot, one acquisition, $68m in funding and many clients later, StormForge today is announcing Optimize Live, the latest extension to its platform. The platform uses machine learning to intelligently and automatically improve application performance and cost-efficiency in Cloud Native production environments.
The first thing to note is that StormForge’s platform had already been doing that for pre-production and non-production environments. The idea is that users specify the parameters that they want to optimize for, such as CPU or memory usage.
Then StormForge spins up different versions of the application and returns to the user’s configuration options to deploy the application. StormForge claims this typically results in somewhere between 40% and 60% cost savings, and somewhere between 30% and 50% increase in performance.
It’s important to also note, however, that this is a multi-objective optimization problem. What this means is that while StormForge’s machine learning models will try to find solutions that strike a balance between the different goals set, it typically won’t be possible to optimize them all simultaneously.
The more parameters to optimize, the harder the problem. Typically users provide up to 10 parameters. What StormForge sees, Provo said, is a cost-performance continuum.
In production environments, the process is similar, but with some important differences. StormForge calls this the observation side of the platform. Telemetry and observability data are used, via integrations with APM (Application Performance Monitoring) solutions such as Prometheus and Datadog.
Optimize Live then provides near real-time recommendations, and users can choose to either manually apply them, or use what Provo called “set and forget.” That is, let the platform choose to apply those recommendations, as long as certain user-defined thresholds are met:
“The goal is to provide enough flexibility and a user experience that allows the developer themselves to specify the things they care about. These are the objectives that I need to stay within. And here are my goals. And from that point forward, the machine learning kicks in and takes over. We’ll provide tens if not hundreds of configuration options that meet or exceed those objectives,” Provo said.
The fine line with Kubernetes in production
There’s a very fine line between learning and observing from production data, and live tuning in production, Provo went on to add. When you cross over that line, the level of risk is unmanageable and untenable, and StormForge users would not want that — that was their unequivocal answer. What users are presented with is the option to choose where their risk tolerance is, and what they are comfortable with from an automation standpoint.
In pre-production, the different configuration options for applications are load-tested via software created for this purpose. Users can bring their own performance testing solution, which StormForge will integrate with, or use StormForge’s own performance testing solution, which was brought on board through an acquisition.
Historically, this has been StormForge’s biggest data input for its machine learning, Provo said. Kicking it off, however, was not easy. StormForge was rich in talent, but poor in data, as Provo put it.
In order to bootstrap its machine learning, StormForge gave its first big clients very good deals, in return for the right to use the data from their use cases. That worked well, and StormForge has now built its IP around machine learning for multi-objective optimization problems.
More specifically, around Kubernetes optimization. As Provo noted, the foundation is there, and all it takes to fine-tune to each specific use case and each new parameter is a few minutes, without additional manual tweaking needed.
There’s a little bit of learning that takes place, but overall, StormForge sees this as a good thing. The more scenarios and more situations the platform can encounter, the better performance can be.
In the production scenario, StormForge is in a sense competing against Kubernetes itself. Kubernetes has auto-scaling capabilities, bot vertically and horizontally, with VPA (Vertical Pod Autoscaler) and HPA (Horizontal Pod Autoscaler).
StormForge works with the VPA, and is planning to work with the HPA too, to allow what Provo called two-way intelligent scaling. StormForge measures the optimization and value provided against what the VPA and the HPA are recommending for the user within a Kubernetes environment.
Even in the production scenario, Provo said, they are seeing cost savings. Not quite as high as the pre-production options, but still 20% to 30% cost savings, and 20% improvement in performance typically.
Provo and StormForge go as far as to offer a cloud waste reduction guarantee. StormForge guarantees a minimal 30% reduction of Kubernetes cloud application resource costs. If savings do not match the promised 30%, Provo will pay the difference toward your cloud bill for 1 month (up to $50,000/customer) and donate the equivalent amount to a green charity of your choice.
When asked, Provo said he did not have to honor that commitment even once to date. As more and more people move to the cloud, and more resources are consumed, there is a direct connection to cloud waste, which is also related to carbon footprint, he went on to add. Provo sees StormForge as having a strong mission-oriented side.