Sergey Maskalik

Sergey Maskalik's blog

In the pursuit of mastery

When choosing a Solid State Drive (SSD) type for your low latency, transactional Amazon Relational Database Service (RDS), Amazon Elastic Block Store (EBS) provides two options: General Storage (GP2) and Provisioned IOPS (IO1). The IO1 type is a much more expensive option, and you want to make sure that your database workload justifies that additional cost. We’ll look at reasons why you may need IO1 vs GP2.

Performance and performance consistency

EBS documentation describes IO1 as “Highest-performance SSD volume for mission-critical low-latency or high-throughput workloads” and with use cases of “Critical business applications that require sustained IOPS performance, or more than 16,000 IOPS or 250 MiB/s of throughput per volume.” IOPS are defined as a unit of measure representing input/output operations per second with operations measured in Kilobytes (KiB) maxim of 256KiB for SSD volumes.

Mark Olsol, Senior Software Engineer on the EBS team, mentioned in one of the Amazon re:Invent talks that between both volume types “… the performance is very similar, it’s the performance consistency that’s different between the two. In benchmark you won’t notice, but you’ll notice it over time.”

Below 16,000 IOPS or 250MiB/s of data throughput both volume types can be configured to have the same amounts of IOPS and, as Mark said, have very similar performance. With the GP2 volume type, IOPS are provisioned by volume size, 3 IOPS per GB of storage with a minimum of 100 IOPS. Provisioning IOPS for the IO1 volume type is not dependent on the disk size as long as it’s below 50 IOPS per 1 GB rate.

The difference in performance consistency that Olson has mentioned, and the actual consistency numbers are specified in the documentation: “GP2 is designed to deliver the provisioned performance 99% of the time” while IO1 “…designed to deliver the provisioned performance 99.9% of the time”. Therefore, below maximum throughput, if your system can tolerate 99% storage performance consistency and does not need 99.9%, GP2 volume type is a much more cost effective option.

Throughput

Throughput measures how much time it takes for a disk to read or write data. The throughput depends on how many IOPS are configured and how much data is read/written per I/O operation (capped at 256KiB). Maximum throughput for GP2 depends on the instance size and I/O size, with the maximum throughput of 250MiB/s achieved at 1000 IOPS x 256KiB at 334 GiB disk size.

If the database workload is data intensive and requires more than 250MiB/s throughput, GP2 will not be the right volume type. Transactional systems don’t usually read/write large amounts of data at once, but still it is possible to hit the 250MiB/s cap with a workload of more than 16KiB I/O size and 16,000 IOPS (5,334 GB disk size). Of course your workload may be different, and you always need to check the average I/O size for your database.

Cost comparison

Given the same number of provisioned IOPS for both drives and throughput of less than 250MiB/s, the performance consistency with IO1 types does not come cheap. And because the performance is very similar, you can save a lot of money if your application doesn’t need 99.9% performance consistency. Here is a table comparing monthly cost of GP2 baseline IOPS with the exact same size and provisioned IOPS of the IO1 type.

Choosing a correct instance type

Another important factor that could limit the performance of your database is an underlying virtual Amazon Elastic Compute Cloud (EC2) instance type. EBS bandwidth varies between different EC2 instance types, and it’s possible that the EC2 bandwidth is less than the maximum amount of throughput supported by your EBS volume. I’ve personally run into this issue when my application’s database instance type was configured at m4.xlarge with dedicated EBS performance of 750 Mbps which translated to about 93.76 MiB/s, which was less than 250MiB/s expected throughput of the storage. EC2 instance type specifications are listed here.

Summary

Taking time to understand your database workload and differences between the two storage types, GP2 and IO1, can potentially reduce costs and improve the performance of your application.

One of the challenges for large multi-tenant systems that rely on many external services to complete a single request is ensuring that each dependency is configured correctly in production environments. In addition, if a system is geographically distributed, each instance may depend on different versions of external services.

Even with the best intentions, humans who have to configure and maintain these types of complex systems are known to be unreliable.

one study of large internet services found that configuration errors by operators were the leading cause of outages, where hardware faults (servers or network) play a role in only 10-25% of outages. Designing Data-Intensive Applications, Martin Kleppmann

Due to many possible test scenarios, manual regression testing after each release is not practical, especially if your team is following a continuous delivery approach.

Monitoring of critical paths provides good feedback about already configured tenants and is an important tool when making production releases. However, some geographical regions may be outside of their peak hours and monitoring alone may not provide fast enough feedback about deployments to production. Also, when new tenants are brought online you still have to validate configuration and dependencies in production.

Traditionally automatic end-to-end tests were reserved only for pre-production environments; however, if your production environments rely on a multitude of external services to complete a single request, extending end-to-end tests to production can help to ensure that all tenant dependencies are configured properly. The goal of these production tests is not to run a full regression suite, but to only test critical paths.

Depending on how your end customers interact with your system, automatic end-to-end tests can run against a public facing web user interface (UI) using something like the Selenium Webdriver, or directly against public API endpoints.

Triggered from a Continuous Integration (CI) system, after deployments to production environments, a test suite will provide an immediate detection and warning of any underlying issues. By having clear results of what is failing, production end-to-end tests will also save time for engineers who might otherwise receive bug reports from different channels and will have to sift through the logs to figure out what went wrong. Finally, because all possible critical paths are tested in production after deploys, tests will provide additional confidence about deploys that may not be present by monitoring alone.

In addition to running end-to-end tests by CI, these tests can be triggered manually or automatically after configuration changes to production. Ability to run critical path tests in production on demand will also reduce manual QA overhead needed to verify that that everything is working as expected after a configuration change.

Unfortunately, running tests in production will create test data in production databases. I personally don’t like to delete data from production databases, even if it’s test data, because your database is a system of record, and it’s important for it to stay intact for troubleshooting. Therefore, you would need to keep track of test users/transactions and filter out test transactions from consumers of your data.

Finally, test results will be fed into a monitoring system and displayed on the team’s dashboard. Alerts are also configured on this data.

Putting it all together

When dealing with a lot of external dependencies and configuration permutations in a system, we need to think outside of the box and engineer solutions that can help us to deal with additional complexity. While it’s important to ship software bug free, there are use cases when it’s much more efficient to verify that software is bug free right after it’s deployed to production.

In the last few years, as my team grew in size, one of the problems that kept coming up during retrospective meetings was the poor turn around with code reviews. With a smaller team there was no need for an additional process to find code review volunteers; since every engineer had to pitch in and review code daily. But with a larger team, not having a clear process or rules to follow was starting to affect the team’s performance. The issues were identified as follows:

  1. It was difficult to find a code review volunteer.
  2. After a volunteer was found, sometimes you would still need to follow up if the review was not getting attention.
  3. Some people would volunteer less than others.
  4. When comments or replies were posted on code reviews, they were not immediately visible because email notifications from GitHub often had to wait until you checked your email. And to get the review process moving along, sometimes you had to message the reviewer directly.

Initial attempts at creating additional process

After multiple discussions with the team, everyone agreed to introduce a simple rule: at the start of a day, each developer will spend 15 minutes on code reviews. This, in theory, would provide more than enough engineers to complete all outstanding reviews and keep review turnaround under 24 hours. A few months later, the results of the experiment were mixed. There was still a lot of delay with getting code reviewed. Sometimes changes requested during code reviews had to be reviewed again, and if your reviewers were already done for the day, it would have to wait another day. Due to the slow feedback cycle, reviews still could take days to get completed. Finally, some engineers were not contributing every day due to being busy with other work.

After another brainstorming session, the team identified that one of the issues with poor turnaround was that outstanding code reviews were not easily visible. Because we work with many GitHub repositories, it is not practical to go into each one to see what code reviews (pull requests) are outstanding. The proposed solution was to use a “pin” feature in Slack, our instant messenger program, which will add a code review link to the team channel’s dashboard. When engineers finished reviewing, they would add a 👍 emoji to the pinned code review, flagging it as done. When two thumbs up appeared on a pinned request, the requestor would merge a code review and unpin the item. This was not a complicated process to follow, but there was still confusion and after another few months, outstanding reviews started to linger on the team’s channel board.

From volunteering to assignment

In an attempt to uncover the underlying problem, one of the engineers extracted data on the number of reviews per person and noticed that reviews were not evenly distributed. Some people did a lot more than others. Asking for volunteers was not working very well and also created a fairness problem.

At this point, it was obvious that we needed to even out the distribution and prioritize assigning engineers with the least number of reviews. We also thought about automating this process. It was not difficult to write a script that would pull review statistics and assign people with the least amount of reviews. However, to save time we looked online to see if someone had already solved this problem, and we found a Pull Reminders commercial application that did exactly what we needed, plus other useful features.

Initially, when we decided to give Pull Reminders a try, we weren’t confident that it would solve the problem. However, after everyone was onboard, we were surprised to learn that the issue with code reviews did not come up again during retrospective meetings. We changed our process from volunteering to assignment, based on the leader-board from the Pull Reminders app. When you need a review, rather than asking or posting in the channel, you will assign two people with the least number of reviews from the leader-board. Pull reminders will take care of notifying and reminding people about outstanding code reviews. The app also improved our communication because it sent personal slack messages when a comment or reply was posted in your code review. This tremendously improved response and turnaround times.

Summary

It’s been a year since we’ve started using Pull Reminders, and I haven’t noticed any confusion or disconnect about code review responsibilities. The majority of reviews are done within a day or two. And we can finally call the problem that caused a lot of discussions and inefficiencies resolved. Most importantly, the new system removed additional rules that everyone had to remember to follow. Now, the system enforces and notifies engineers when they need to review code, and it’s hard to ignore.

Coming up with new rules for everyone to follow is easy but sometimes ineffective. A much better approach is to create a system that makes new rules hard to ignore.