Why Right Sizing Instances is Not Nonsense

I like Corey Quinn — his newsletter and blog make some good points, but his recent post, Right Sizing Your Instances Is Nonsense, is a little off base.  I encourage you to read it in its entirety.

“Right Sizing means upgrading to latest-gen”

Corey makes the argument that upgrading an m3.2xlarge to a m5.2xlarge for a savings of 28% is the correct course of action.  We have a user with > 30 m3.2xlarge instances whose CPU utilization is typically in the low digits, but which spikes to 60+% periodically.  Whatever, workloads rarely crash because of insufficient CPU — they do, however, frequently crash because of insufficient memory.  In this case, their memory utilization has never exceeded 50%.

Our optimizations, which account for this and other utilization requirements, indicate that the “best fit” for their workload is in fact an r5.large, which saves them ~75%.  In this case, for their region, the calculation is:

  1. m3.2xlarge * 0.532000/hour * 730 hours/month * 30 = $11,650.80/month
  2. r5.large * 0.126000/hour * 730 hours/month * 30 = $2759.40

The approximate monthly difference is $8891.40/month

Now, these assume on-demand instances, and reserved instances can save you a substantial amount (29% in this case at $0.380 per instance/hour), but you’re locked in for at least a year and you’re still overpaying by 320%.

“An ‘awful lot of workloads are legacy’ -> Legacy workloads can’t be migrated”

So, this one’s a little harder to tackle just because “an awful lot” doesn’t correspond to a proportion, but let’s assume it means “100%” just to show how wrong this is according to the points he adduces:

The Big Three: Comparing AWS, Azure and Google Cloud for Computing

If you’ve heard of cloud computing at all, you’ve heard of Amazon Web Services (AWS), Microsoft Azure and Google Cloud. Between the three of them, they’ll be raking in over $50 billion in 2019. If you’re on the cloud, chances are good you’re using at least one of them.

The latest RightScale State of the Cloud Report pegs AWS adoption at 61%, Azure at 52% and Google Cloud at 19% (see the purple above). What’s more, almost all respondents (as denoted in blue) were experimenting with or planned to use one of the top three clouds. Which, if you math that up, means that 84% of respondents are going to be using AWS at some point, 77% will be using Azure and 55% will be using Google Cloud.

AWS, Azure & GCP market share

Multi-cloud strategies are definitively A Thing, contrary to some folks’ opinions and the overwhelming one-cloud-to-rule-them-all desire of AWS. So it’s worth comparing them. On a broad level, AWS rocks and rolls with capabilities set to lock you into their cloud, while Azure’s great for enterprises and Google Cloud’s your go-to if you want to do AI. But, as with all things, there’s more to it than that, and it’s not just where you can get the best cloud credit deals.

Everything is a Data Problem

You wouldn’t think that the primary issue with optimizing cloud computing workloads would be getting good data. Figuring out math problems (hello, integer-constrained programming) worthy of a dissertation, sure. Writing a distributed virtual machine, maybe. Getting good data about a workload to run against good data about what the viable machines to put it on are? Not so much.

Well, you would be wrong. While the majority of the IP is in said math problems, the majority of the WORK is in the data — getting it and cleaning it up. And the data problem alone is enough to make you realize why everyone just picks an instance size and rolls with it until it doesn’t work anymore.

Last week we started the work to expand our platform from AWS-only to Azure. One of the first steps to that is what we call a “catalog”: a listing of all the possible virtual machine sizes across all possible regions with all of their pricing information (because, of course, pricing and availability vary). You would hope that this sort of catalog would be readily accessible from a cloud service provider (CSP). At the moment, the state-of-the-art is the work of many open-source contributors working together to scrape different CSP sets of documentation.

For AWS, we love ec2instances.info for this information, though we still had to get all of the region information in less savory ways. Different folks have attempted to do similar things for Azure, but Azure doesn’t make it easy. Pricing is different across Linux and Windows, because of course it is, but the information they give you when trying to look at pricing is missing some bits:

Screenshot comparing B-Series instances on Azure

Perfectly Provisioned: 22 Random Things That Fit Perfectly Into Each Other

When two random things fit together perfectly, it creates a special kind of magic — Like stumbling across a way to bring order to the chaos of everyday life. Maybe that’s what makes these 22 photos so satisfying?

1) Cat crammed into a box

2) A pill and a ruler

Before You Buy a Reserved Instance, Read This

Reserved Instances are an enormous investment.

At first glance, that statement might seem counter-intuitive. Reserved Instances (RIs) are widely advertised as the best way to save big on your Amazon Web Service (AWS) cloud compute bill. And in many cases, they are. With Reserved Instances, companies commit to long-term usage by agreeing to rent virtual machines for a set amount of time (typically 1 to 3 years) in exchange for a significantly lower rate than on-demand pricing. When viewed through this lens, they appear to be a vital part of an AWS cost management strategy.

Cost Savings

Take Amazon EC2 as an example. When compared to on-demand pricing, Amazon EC2 RIs offer customers potentially deep discounts — sometimes as much as 75%, per their marketing. While reserving cloud capacity in advance seems like the smart thing to do because it has the potential to deliver a significant amount of savings, the savings promised by RIs often have a dangerous downside — and any missteps can have substantial costs for your company.

The calculations involved in deciding which RI to purchase can be frustratingly complicated. One year or 3 year contract? What about tenancy? Instance size? Region and zone? New or from the marketplace? And don’t forget about the nuance of offering class — do you want your RI standard, convertible or scheduled?

These calculations are difficult, but absolutely vital when committing to a RI. Rather than signing a contract for exactly what you have now (in terms of size, region, and tenancy) and guessing at the length of time that will fit your instance, it’s essential to understand the exact shape of your usage needs. Without that kind of granular insight into your workload, it’s impossible to choose a RI that will be the right fit six months from now, let alone three years in the future.

In the end, many companies buy RI capacity that ends up exceeding their actual needs, because they’re already using capacity that exceeds their needs. Unfortunately, committing to more capacity than you actually need can be very costly over the length of a RI contract. When that happens, the long-term return on investment (ROI) ultimately evaporates.

How We Optimize Based on Resource Utilization Data

We frequently get asked what makes our AWS cost optimization so good. AWS cost management feels like it should be easy, and we talk to a lot of folks who think they’ve done a good job of it. The fact is, we’ve yet to see anyone who’s not wasting at least 40% of their EC2 bill. Let’s walk through it on our platform, and it’ll make sense why.

screenshot of a virtual machine report within the Sunshower platform

Fitting an Instance

It all starts with knowing what you’re actually using, resource-wise. Figuring this out as a human is surprisingly hard. For Sunshower, we look at the past month of a virtual machine’s life (if we have it — that’s our default) and sample every minute (by default, but it’s adjustable). After smoothing the data, that’s how we discover, in this case, 1 CPU (of the 8 they’re paying for) and 10G RAM (of the 30 they’re paying for) are actually being used.

In the screenshots below, you can see the resulting “shape” of the workload on the virtual machine. First, on the left: current vs utilized. The grey is what they’re currently paying for, and the purple is what they’re utilizing. Frankly, it LOOKS like a pretty good fit.

To compare, let’s look at the screenshot on the right: optimized vs utilized. There’s our purple triangle of utilization again. This time, you’ll see the optimized fit we found in blue. Even though the blue section looks a lot bigger, it actually reflects a substantial cost savings over the original, grey fit on the left.

resource utilization compared to purchased virtual machine

How is that possible? The thing you’re really paying for, in most machines, is CPU and Memory. So, the closer a fit you can get on those, the better. In the image on the left, you can see that the majority of the overprovisioning is taking place in the most expensive areas of cloud spend: CPU and memory. Tightening that fit up in CPU and memory, like you see represented in blue in the image on the right, might look like an incremental change from the image on the left, but in reality it adds up.

4 Strategies for Cloud Cost Optimization

We’ve talked about the most common causes of cloud waste, and how they can negatively impact your company’s bottom line.

Whether it’s choosing the wrong instance size, not fully understanding cloud pricing options, leaving unused resources running, or locking yourself into inflexible reserved instance contracts, there are lots of ways to end up with a cloud bill that wreaks havoc on your financials.

What can you do to keep cloud costs down and reduce your part of the $14.1 billion that will be wasted on cloud compute resources in 2019? You can avoid cloud waste by adopting smart cloud cost optimization strategies. Here are a few good places to start.

1. Don’t Over-Provision Your Cloud Infrastructure

Remember when there was that great deal on strawberries so you bought a bunch because you thought you’d surely eat them? And then you never did? Just like it’s hard to figure out what you’re really going to eat in a week, it’s hard to figure out what resources your software really needs in order to run. You can use a monitoring solution to determine what your resource utilization actually looks like in production, determining how much of critical resources like memory, CPU, disk, networking and more you’re using. Then, it’s a matter of aligning that with an instance size (which unfortunately sometimes feels more like blindly buying strawberries in bulk than reaching for a pre-packaged pint, considering the sheer number of options per cloud service provider).

2. Turn Off Idle Cloud Infrastructure

The main cause of idle capacity is leaving non-production machines up and running 24/7. Consider spinning down build, QA, demo and development environments during off hours. You can schedule them to turn off when your night owls leave and turn back on before the early birds come in. On the production side, use auto-scaling groups to help meet peak demand times. And of course, be vigilant that as people and products come and go you’re monitoring what systems are actually being used.

Cloud Waste is Costing You More Than You Think

Cloud over-provisioning is a lot like like buying strawberries in bulk.

I know that sounds weird, but hear me out:

I once bought a huge tub of strawberries at Costco. Sure, there were way more strawberries than I could probably eat, but they were such a great deal– just a little more expensive than a small box of berries at the grocery store. The lure of the deal was strong, so I caved.

What happened next? I ate a few berries out of the tub when I got home, promptly put them in the crisper drawer, and completely forgot about them. The next time I opened the drawer it was like a scene out of Avatar — a mysterious new greenish-blue ecosystem, composed of fuzzy, strawberry-shaped blobs.

Bulk strawberries sound like a great deal, but only if you’re going to, you know, actually eat all those strawberries.

We all pay for stuff we don’t end up using. Whether it’s those strawberries slowly creating a new ecosystem in the fridge or your company’s public cloud infrastructure, that waste can seriously add up. The moral of the story? It’s never a good deal if you don’t use what you’re paying for.

Cloud Waste

As cloud computing continues growing in popularity, more and more companies are turning to the cloud for their computing needs. But if you’re spending any money on a cloud service provider, it’s pretty darn likely that some of those funds are being lost on overprovisioned or underutilized cloud infrastructure. (Think: the proverbial untouched, moldy strawberries in the fridge.) It’s human nature to buy more than you need “just in case,” and this applies to cloud infrastructure too.

3 Reasons Your Cloud Bill is So High

At Sunshower.io, we talk to a lot of people about their cloud infrastructure usage. In our professional lives, we’ve dealt with the confusion caused by different cloud vendors, including confounding billing methods, lack of insight into the infrastructure you’ve built, and just throwing hardware and money at the current problem and hoping it’ll fix it. Understandably, the question we’re most frequently asked is the one that’s most mission-critical: How did my cloud bill get like this and how do I get it down?

1) You Forgot About Some Infrastructure

“Cloud sprawl” is extremely common, and happens when you’re running more cloud instances than necessary. It’s easy to see how this can happen—running workloads that you’ve forgotten about and unused and idle workloads are all key culprits. In a complex cloud ecosystem, it can be tough to keep watch over everything running in the cloud. Monitoring and controlling those workloads is key to making sure you’re not over-spending on the cloud. If your company isn’t using auto-scaling, you might be running instances 24/7 that aren’t always performing a necessary function. Running instances that you’re not using is essentially throwing money away—like going away for the weekend and leaving all of your lights on.

2) You Bought Too Much “Just In Case”

Overprovisioning refers to buying more cloud compute resources than you typically need. It’s important to tailor what you buy to actual usage, because it really adds up. The first step is figuring out what you’re actually using, which monitoring and cloud cost optimization tools can help with. If this process is overwhelming, there are vendors you can work with to help you sift through your options and make the best possible choices. Without good cloud monitoring tools, it’s impossible to see what you’re wasting. Only then should you start looking into what to buy instead.

3) You Drank The Vendor Kool-aid

The custom services provided by cloud service providers are tempting, but the cost can really add up. Even worse, it removes your ability to migrate to other cloud providers, so it’s hard to pivot to more cost-effective solutions over time. As you build your cloud strategy, try to avoid locking yourself into a relationship with a single cloud service provider. Don’t tie yourself to a single vendor because it’s convenient—make sure that you’re allowing yourself the flexibility to change providers and adapt new strategies when costs start to increase.

Memory Consumption in Spring and Guice

Spring vs guice

The heart of the Sunshower.io cloud management platform is a distributed virtual machine that we call Gyre. Developers can extend Gyre through Sunshower’s plugin system, and one of the nicer features that Gyre plugins provide is a dependency-injection context populated with your task’s dependencies, as well as extension-point fulfillments exported by your plugin. For example, take (part) of our DigitalOcean plugin:

The details are beyond the scope of this post, but basically Gyre will accept an intermediate representation (IR) from one of its front-ends and rewrite it into a set of typed tuples whose values are derived from fields with certain annotations. The task above will be rewritten into (ComputeTask, Vertex, TaskResolver, TaskName, Credential). This tuple-format is convenient for expressing type-judgements, especially on the composite and recursive types that constitute Gyre programs.