Why Right Sizing Instances is Not Nonsense

I like Corey Quinn — his newsletter and blog make some good points, but his recent post, Right Sizing Your Instances Is Nonsense, is a little off base.  I encourage you to read it in its entirety.

“Right Sizing means upgrading to latest-gen”

Corey makes the argument that upgrading an m3.2xlarge to a m5.2xlarge for a savings of 28% is the correct course of action.  We have a user with > 30 m3.2xlarge instances whose CPU utilization is typically in the low digits, but which spikes to 60+% periodically.  Whatever, workloads rarely crash because of insufficient CPU — they do, however, frequently crash because of insufficient memory.  In this case, their memory utilization has never exceeded 50%.

Our optimizations, which account for this and other utilization requirements, indicate that the “best fit” for their workload is in fact an r5.large, which saves them ~75%.  In this case, for their region, the calculation is:

  1. m3.2xlarge * 0.532000/hour * 730 hours/month * 30 = $11,650.80/month
  2. r5.large * 0.126000/hour * 730 hours/month * 30 = $2759.40

The approximate monthly difference is $8891.40/month

Now, these assume on-demand instances, and reserved instances can save you a substantial amount (29% in this case at $0.380 per instance/hour), but you’re locked in for at least a year and you’re still overpaying by 320%.

“An ‘awful lot of workloads are legacy’ -> Legacy workloads can’t be migrated”

So, this one’s a little harder to tackle just because “an awful lot” doesn’t correspond to a proportion, but let’s assume it means “100%” just to show how wrong this is according to the points he adduces:

“Older versions of operating systems don’t support the newer hypervisor.”

This one is super baffling.  The hypervisor is a layer beneath the operating system, which means that, in a perfect world, an application running on a virtualized server should have no idea what hypervisor technology is actually being used.  It’s not like a certain version of RedHat will only work on Xen and the moment you move to Nitro it jeffries up your operating system.  Indeed, you can verify this by launching any version of any distro of Linux onto either Xen or Nitro.

AWS themselves refutes this point:

Will applications need to be modified?
Most of the time, no. Some applications have relied on
undocumented behavior to detect they are running within
EC2 and they may require adjustment.

There may be some incompatibility in the network drivers, but it’s relatively easy to circumvent those — we rarely suggest from ENA-capable to ENA-disabled instance classes, but we can also install the drivers for you (and you can disable this form of modification)

Which brings me to his second point on this matter, viz.

“Workloads are ‘certified’ by either external vendors or internal divisions to run on certain versions of various bundled libraries.”

Given that we’ve just established that, in most cases, upgrading the hypervisor does not require an OS change, this point is moot.

“Upgrading is hard”

Unless you’re on a reserved instance, you can change instance type easily as follows:

  1. Log into console.aws.com
  2. Select the correct region
  3. Identify the correct instance type (cf. workload certification section)
  4. Set instance state to “stop” (CAUTION: SEE NOTE BELOW)
  5. Select “change instance type”
  6. Select the new instance type
  7. Set instance state to “start”

Or:

  1. Log into https://sun.sunshower.io
  2. Discover your system
  3. Click “Optimize” on the given instance type

Yes, whatever is running on that instance must be halted, and you may have to restart processes (use USERDATA!), but here’s the thing:

You’re going to have to move, anyway.  Whether it’s a hardware failure (uncommon, but not as uncommon as you might believe), or a nuts security vulnerability in your OS or hardware or whatever, no VM can run forever.  If you stop it even once, you might as well start it up with the correct instance type.

NOTE: Instance-local storage (e.g. if the SKU comes with something like 2x80GB NVMe SSDs) is not guaranteed to be available through a stop/start cycle.  Be careful and only use this hardware for ephemeral workloads that require the performance/data locality.

“You’ll never find the proper instance type for your workload”

Corey’s last point is his strongest.  Between all SKUs across all regions, including reserved instances and RDS instances, there are over 250,000 SKUs altogether.  There are also dozens of metrics that you’ll need to consider when comparing these workloads against a given SKU.  At Sunshower.io, we acknowledge this and have removed this particular barrier.  You can, in seconds, discover the optimal instance/workload alignment according to your application, over any time period.

Screenshot of Sunshower.io optimization results

So here you go, Corey — some nice, cool lemonade to go with that hottest of takes.

10 comments

  1. The certified workload point is hardly moot – it may run moved, but the typical response when your vendor’s support finds out is to blame any and all problems on the “wrong” instance type and close the case. Is that stupid? Absolutely. But that’s what happens.and for legacy workloads, support is essential, because that’s how you do license renewal among other things.

    I will add that stop/change-type/start has edge case failure modes, especially when the jump in types is large and you’re not using amazon Linux, which is typical of legacy workloads. I’ve been burned – always have a backup and rollback plan.

  2. Hi Joe, I like the piece, and you make some good criticisms.

    That said I would still recommend caution…

    firstly, you shouldn’t be manually changing instances through the console to start with. What, everything is code, all the way down? Hehe…

    And with code, there is testing. So change your Terraform code and then verify it works. Now just because you have a variable indicating the instance size, doesn’t mean changing it won’t break something.

    Changing an instance size and redeploying could break all manner of things. It’s possible you used a variable for the instance size in some places and hard coded it in others. Or made some weird reference in an autoscaling group. It may be the AMI you’ve built works on one type of instance but not another. Or that your AMI is deployed in one region but not another. Or that your old instance size is available in us-east-2, but your new instance size is not yet available there. Yes the console wouldn’t have offered it, but your Terraform code didn’t know.

    What’s more when you change instance sizes, your network bandwidth and memory change too. You may figure you have plenty of spare memory only to find one component of your application makes some weird check you didn’t know about. Or perhaps you do some Ansible tricks after boot which suddenly behave finicky on the new box.

    Or further, perhaps the boxes just haven’t had their instance types changed before, and simply break.

    I would argue that you’re both a bit right.

    -Sean

    1. Hi Sean,

      You’re absolutely right–instance attributes like size should be managed in your infrastructure-as-code solution. We do see a lot of instances, especially for legacy systems that aren’t managed via IAC. One of the features we’re going to be rolling out shortly is exporting discovered infrastructure into Terraform and we hope that that will alleviate some of the pain of managing these legacy instances.

      And yeah, instance resizing continues to be an experimental feature for us and will certainly never be perfect, but for the preponderance of use-cases we can do a pretty good job ensuring that we don’t recommend a type that is misaligned with your workload, and you can certainly tune your results. For instance, you can set floors and ceilings for given metrics (e.g. don’t consider instances with < 30 GB memory) and our users seem to find that that works well for them.

  3. “Locked in for a year” – this is a bit misleading
    If you want to receive the “full benefit” of the RI, i.e. the full savings over the entire period, then yep correct.

    However if you just want to make sure you dont lose money on the RI, you’re “locked in” for 8.5714… months after that point you’ll never lose a dollar.
    So if you purchase the RI & use it for 9 months & have it sitting there & do nothing for the next 3 months, it’ll be the cheapest option for 9months.

    1. Hi Nate,

      I definitely don’t intend to be misleading–I mentioned that RI can provide some substantial savings, but possibly I should’ve been more clear. I was trying to convey more that, in our example, a better fit for their workload would’ve cut their cost by ~2/3 even over the 1 year RI.

  4. Interesting read and I agree with many of your points. However, regarding a switch change to Nitro instances, the change is not necessarily straight forward.

    For example due to the way EBS disks are exposed on nitro instances (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html) you get non-deterministic ordering of EBS attachments with linux kernels. In our case this caused us a not insignificant amount of re-work to update various CloudFormation templates and updating of custom AMIs to support the newer families.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: