On Wrong Answers

July 6, 2020

I love this quote from Kellan Elliot-McCrea:

Your answers are wrong. Or they will be soon.

As we learn, grow and the world around us changes, it is inevitable that our views and beliefs change as well. One area where software engineers are bound to change opinions is around languages and frameworks.

With time and opportunity to build projects with technologies that were once trending and exciting, we notice that even the latest technology has pros and cons, and is not a perfect utopia like we thought it would be when we were just starting out.

I have had my share of wrong answers. I wrote about one of them in the post about JavaScript back in 2014. To understand why I wrote it in the first place, I wanted to take the trip down the memory lane, look at the evolution of JavaScript, and how it has affected my opinion over the years.

Back in the early days of the web, most web pages were generated on the back-end by web servers, and resulting HTML was sent to a client to be rendered by a browser. Navigation between pages or content required every page to be loaded by the browser. In the mid-2000s, GMail and Google Maps were launched using a new technology called AJAX, shorthand for Asynchronous JavaScript + XML, which allowed for pages to be dynamically updated from the web server without making a full page load. Being obsessed with computers running fast, rendering content without a full page refresh seemed to me so much quicker and almost more magical than the traditional way. I quickly became a fan of this new approach and jumped on any opportunity to implement such behavior.

Around the year 2006, the jQuery library came along and made it very easy to dynamically update the HTML Document Object Model. Besides providing an easy to use fluent API, jQuery also abstracted differences of JavaScript (ECMAScript) implementation of all major browsers. Finally, it made it easy to initiate AJAX requests, parse the data, and then update a portion of a web page. As a side note, very soon after, Microsoft also came out with its AJAX implementation for the ASP.NET Web Forms: ASP.NET AJAX 1.0. It provided JSON serialization features for .NET web-services and allowed developers to use 3rd party JavaScript libraries like jQuery directly to request data formatted as JSON from the back-end. At the beginning of my career as a .NET developer, jQuery with ASP.NET Web Services became my favorite tool when building dynamic web applications.

In the early 2010s, many libraries and frameworks started popping up that allowed engineers to render entire sites with JavaScript and make them look more like native or desktop applications rather than web sites. They were called single-page applications or SPAs. A few of the first ones that come to mind are Backbone, Angular, Ember, and Knockout.js.

My first experience working on a couple of medium-sized SPAs using the Knockout.js library taught me that I could no longer get away with just a surface understanding of JavaScript, which was enough earlier when I putt together sites with jQuery. In addition to reading JavaScript: The Good Parts by Douglas Crockford, I also purchased and read JavsScript Patterns by Stoyan Stefanov. Those two books helped me to become much more comfortable with the language and make my now non-trivial JavaScript projects more maintainable.

As I was becoming more comfortable with writing non-trivial single-page JavaScript applications, Node.js was starting to gain popularity. It opened up a new frontier, and JavaScript developers could now not only create applications in a browser but also on the server. Node.js brought the simplicity of JavaScript to the backend. It was easy to get started, just import some packages and run one command to execute your script. Because I enjoyed the novelty and innovativeness of SPAs, I was naturally gravitating at playing around with Node.js and wanting to use it in my future projects.

Besides being simple to get started with, Node.js had multiple other selling points going for it. First of all, because typically all I/O in a JavaScript engine is performed via events and callbacks, asynchronous in nature and never blocks the executing thread, Node.js is very efficient at processing many requests and is very well suited for web applications. It could handle a much bigger throughput than synchronous web applications. Another selling point was using a single language to write your front end, SPA application and backend to power it. Lastly, because Node.js library came only with a basic core library that helped to write server applications, it relied on the community to write the missing pieces. Very soon, every little bit of non-trial code was published as a node package manager (NPM) package and could be imported into your project with a one line command. Almost like Apple’s trademark “There’s an App for That”, npm had a package for anything you could think of. A large number of available packages allows JavaScript developers to build apps really fast. But that speed did not come for free. Not having a standard library encouraged package developers to rely on other packages, which in turn also relied on other packages. A number of dependencies started to become a major source of frustration. The following quote is from an article that describes this problem really well:

In order to use these 3 methods node_modules needs 1826 files. And that’s just 4 of mentioned 976 installed packages. - What’s really wrong with node_modules and why this is your fault

With a larger amount of dependencies, there is also a higher likelihood that a dependency may need to be updated and may not be compatible with existing code, therefore a large dependency graph is also more costly to maintain in the long term. Around the year 2015, it took me more than a week to put together a clean boilerplate that needed everything for a modern SPA project (React, Redux, Webpack, History, Router, Hot Reloading, etc,.) and only a year to learn that some of the dependencies were obsolete, and I if I wanted to migrate to the next version of Webpack everything had to be redone. It felt like a wasted effort. Luckily thanks to a very helpful project, Create React App, I don’t have to do this again. It takes care of setting up and configuring all the tools necessary for a React SPA and makes it easier to maintain project dependencies.

Using Node.js on large backend projects may be quicker to build, as PayPal reported back in 2013, but due to the dynamic nature of the language, and not having a type system, I could imagine it will be more difficult to maintain it over the long term. Yes, you can write a lot of unit tests to make sure your refactoring did not break dependencies or use TypeScript to add typing. However, the former comes with a good deal of overhead, and it might be just easier to use a statically typed language in the first place rather than going with the latter.

My Soon To Be Wrong Opinions Today

SPAs are a lot more expensive to build and maintain, and they come with a good deal of additional complexity. Given a lot of overhead, I think SPAs should not be a default choice for building web applications, even today. Yes, there is a place for them but you need to make a conscious decision and accept the overhead and tradeoff. Over time I’ve changed my approach and found it’s much more interesting to solve business problems rather than spending a lot of mental energy just so you can achieve an arbitrary faster and smoother user interface.

If you are not building a native-like UI but still need some portion of the page to be dynamically updated, the same technique we used a decade ago, making AJAX calls to the back-end and updating a page with JavaScript, is still effective, simple, and comes with very little overhead.

There are many valid use cases where JavaScript on the back-end is a good choice. But unless your application is taking advantage of the async I/O and event loop, it may be easier to maintain it in the long term with a language that comes with a proper standard library and is statically typed.

thoughts edit

Capacity Planning at a Widget Factory

June 11, 2020

Imagine you are an engineer working at a widget factory, and because the business is continuously growing, you estimate that you will soon need to support five times the current production capacity.

Because it’s a modern factory, the performance of every stage of the widget assembly is measured and then displayed on electronic dashboards. Based on the metrics, you can tell that the assembly line is at half capacity at the current rate of production. Therefore, to get to the target capacity without any optimizations is not possible, and you are not confident that it can even be accomplished.

The simple approach to increase the capacity is to add additional resources to each assembly stage based solely on the current production capacity and utilization. In other words, if one of the assembly stages is running at 50% capacity and produces 100 widgets, you figure out that you will need at least 50% * 5 = 250% more resources to produce 500 widgets. If one of those assembly stages is a physical resource like a paint station, getting to 250% would mean that you would need to buy and add two additional stations to support the 250% capacity increase.

This approach, however, has some flaws. It makes an assumption that your assembly line is already optimized to support the projected capacity. But in reality, there are many areas that could be improved to produce a higher throughput without additional resources. Increasing every resource across the board is inefficient, requires a lot more maintenance, and is expensive. It’s also wasteful if the expected capacity does not materialize.

Another alternative to increase capacity would be to try to dial up your current production assembly line to see how much it can handle. But this is dangerous because if one of the stages does go out of order, it can create downtime for the entire assembly, and your factory must be producing at all times.

Looking at the dashboards, you see that there are slow areas that can be optimized. But you are still not sure if the optimizations are going to make the entire assembly run quicker. In other words, are you actually discovering the bottlenecks or just optimizing blindly?

Luckily, your company has a new branch opening up soon that has the same assembly line installed, a perfect place for you to run some experiments.

So you set up your factory to mimic the production workload and systematically work to figure out what needs to be done to improve the throughput. By gradually increasing workload, you start noticing which stages are starting to break down and create bottlenecks. To resolve a bottleneck, you get to exercise your engineering skills and come up with a solution to make the steps faster. Sometimes, figuring out how to make a portion of an assembly line run faster is not an easy task, and you need to dive deep into understanding how a particular machine works and obtain valuable knowledge in the process. It’s fun and challenging at the same time.

In the process of pushing your assembly line to the limit, you get to learn about the physical limitations of each stage. Perhaps there are some stages that cannot be more optimized and you may have to redesign or upgrade a section of the assembly line. In order to squeeze every last bit of performance, you had to experiment with many different settings, test many different scenarios, and try out different configurations. As a result, you are now more intimately familiar with the system as a whole; there is no more guesswork needed and you are much better positioned to do the following:

Create a realistic capacity plan.
Identify major performance bottlenecks in your system.
Improve the ability to increase or decrease capacity.
Improve reliability during increased load.
Improve the performance of the assembly line.

There could also be some other interesting findings. Perhaps, in the past, you thought that some areas of the assembly would start breaking down earlier than expected and had more resources allocated, but these theories were not realized. As a result, you can save the company money by getting rid of the excess capacity that would never get utilized under the current setup.

Finally, the most exciting outcome of performing these experiments is making your assembly line run faster and much more efficiently. Your customers are much happier that they can get widgets faster, and your business is more cost-effective as a result.

This was my attempt to use an analogy describing a load testing project that I was recently involved in. I understand that sometimes you may not be in a position to re-create a production-like environment, but if you are lucky enough to have an application that runs in the cloud and infrastructure specified as code, standing up a production mirror and investing a bit of time on load testing can have some profound results. As a result of load testing, at my work, we were able to increase the performance and capacity of our system by 10x using the same hardware. This will translate into a more reliable system that can handle bursts of traffic and be less expensive in general.

Special mention goes to the awesome load testing tool, Locust.io. It was very easy to use, and it took only a couple of days to write test scripts to mimic critical path interaction with our services.

load-testing edit

AWS RDS: You may not need Provisioned IOPS

May 31, 2020

When choosing a Solid State Drive (SSD) type for your low latency, transactional Amazon Relational Database Service (RDS), Amazon Elastic Block Store (EBS) provides two options: General Storage (GP2) and Provisioned IOPS (IO1). The IO1 type is a much more expensive option, and you want to make sure that your database workload justifies that additional cost. We’ll look at reasons why you may need IO1 vs GP2.

Performance and performance consistency

EBS documentation describes IO1 as “Highest-performance SSD volume for mission-critical low-latency or high-throughput workloads” and with use cases of “Critical business applications that require sustained IOPS performance, or more than 16,000 IOPS or 250 MiB/s of throughput per volume.” IOPS are defined as a unit of measure representing input/output operations per second with operations measured in Kilobytes (KiB) maxim of 256KiB for SSD volumes.

Mark Olsol, Senior Software Engineer on the EBS team, mentioned in one of the Amazon re:Invent talks that between both volume types “… the performance is very similar, it’s the performance consistency that’s different between the two. In benchmark you won’t notice, but you’ll notice it over time.”

Below 16,000 IOPS or 250MiB/s of data throughput both volume types can be configured to have the same amounts of IOPS and, as Mark said, have very similar performance. With the GP2 volume type, IOPS are provisioned by volume size, 3 IOPS per GB of storage with a minimum of 100 IOPS. Provisioning IOPS for the IO1 volume type is not dependent on the disk size as long as it’s below 50 IOPS per 1 GB rate.

The difference in performance consistency that Olson has mentioned, and the actual consistency numbers are specified in the documentation: “GP2 is designed to deliver the provisioned performance 99% of the time” while IO1 “…designed to deliver the provisioned performance 99.9% of the time”. Therefore, below maximum throughput, if your system can tolerate 99% storage performance consistency and does not need 99.9%, GP2 volume type is a much more cost effective option.

Throughput

Throughput measures how much time it takes for a disk to read or write data. The throughput depends on how many IOPS are configured and how much data is read/written per I/O operation (capped at 256KiB). Maximum throughput for GP2 depends on the instance size and I/O size, with the maximum throughput of 250MiB/s achieved at 1000 IOPS x 256KiB at 334 GiB disk size.

If the database workload is data intensive and requires more than 250MiB/s throughput, GP2 will not be the right volume type. Transactional systems don’t usually read/write large amounts of data at once, but still it is possible to hit the 250MiB/s cap with a workload of more than 16KiB I/O size and 16,000 IOPS (5,334 GB disk size). Of course your workload may be different, and you always need to check the average I/O size for your database.

Cost comparison

Given the same number of provisioned IOPS for both drives and throughput of less than 250MiB/s, the performance consistency with IO1 types does not come cheap. And because the performance is very similar, you can save a lot of money if your application doesn’t need 99.9% performance consistency. Here is a table comparing monthly cost of GP2 baseline IOPS with the exact same size and provisioned IOPS of the IO1 type.

Choosing a correct instance type

Another important factor that could limit the performance of your database is an underlying virtual Amazon Elastic Compute Cloud (EC2) instance type. EBS bandwidth varies between different EC2 instance types, and it’s possible that the EC2 bandwidth is less than the maximum amount of throughput supported by your EBS volume. I’ve personally run into this issue when my application’s database instance type was configured at m4.xlarge with dedicated EBS performance of 750 Mbps which translated to about 93.76 MiB/s, which was less than 250MiB/s expected throughput of the storage. EC2 instance type specifications are listed here.

Summary

Taking time to understand your database workload and differences between the two storage types, GP2 and IO1, can potentially reduce costs and improve the performance of your application.

aws rds edit

Sergey Maskalik's blog