Sergey Maskalik

Sergey Maskalik's blog

Focusing on improving tiny bit every day

One of the challenges for large multi-tenant systems that rely on many external services to complete a single request is ensuring that each dependency is configured correctly in production environments. In addition, if a system is geographically distributed, each instance may depend on different versions of external services.

Even with the best intentions, humans who have to configure and maintain these types of complex systems are known to be unreliable.

one study of large internet services found that configuration errors by operators were the leading cause of outages, where hardware faults (servers or network) play a role in only 10-25% of outages. Designing Data-Intensive Applications, Martin Kleppmann

Due to many possible test scenarios, manual regression testing after each release is not practical, especially if your team is following a continuous delivery approach.

Monitoring of critical paths provides good feedback about already configured tenants and is an important tool when making production releases. However, some geographical regions may be outside of their peak hours and monitoring alone may not provide fast enough feedback about deployments to production. Also, when new tenants are brought online you still have to validate configuration and dependencies in production.

Traditionally automatic end-to-end tests were reserved only for pre-production environments; however, if your production environments rely on a multitude of external services to complete a single request, extending end-to-end tests to production can help to ensure that all tenant dependencies are configured properly. The goal of these production tests is not to run a full regression suite, but to only test critical paths.

Depending on how your end customers interact with your system, automatic end-to-end tests can run against a public facing web user interface (UI) using something like the Selenium Webdriver, or directly against public API endpoints.

Triggered from a Continuous Integration (CI) system, after deployments to production environments, a test suite will provide an immediate detection and warning of any underlying issues. By having clear results of what is failing, production end-to-end tests will also save time for engineers who might otherwise receive bug reports from different channels and will have to sift through the logs to figure out what went wrong. Finally, because all possible critical paths are tested in production after deploys, tests will provide additional confidence about deploys that may not be present by monitoring alone.

In addition to running end-to-end tests by CI, these tests can be triggered manually or automatically after configuration changes to production. Ability to run critical path tests in production on demand will also reduce manual QA overhead needed to verify that that everything is working as expected after a configuration change.

Unfortunately, running tests in production will create test data in production databases. I personally don’t like to delete data from production databases, even if it’s test data, because your database is a system of record, and it’s important for it to stay intact for troubleshooting. Therefore, you would need to keep track of test users/transactions and filter out test transactions from consumers of your data.

Finally, test results will be fed into a monitoring system and displayed on the team’s dashboard. Alerts are also configured on this data.

Putting it all together

When dealing with a lot of external dependencies and configuration permutations in a system, we need to think outside of the box and engineer solutions that can help us to deal with additional complexity. While it’s important to ship software bug free, there are use cases when it’s much more efficient to verify that software is bug free right after it’s deployed to production.

In the last few years, as my team grew in size, one of the problems that kept coming up during retrospective meetings was the poor turn around with code reviews. With a smaller team there was no need for an additional process to find code review volunteers; since every engineer had to pitch in and review code daily. But with a larger team, not having a clear process or rules to follow was starting to affect the team’s performance. The issues were identified as follows:

  1. It was difficult to find a code review volunteer.
  2. After a volunteer was found, sometimes you would still need to follow up if the review was not getting attention.
  3. Some people would volunteer less than others.
  4. When comments or replies were posted on code reviews, they were not immediately visible because email notifications from GitHub often had to wait until you checked your email. And to get the review process moving along, sometimes you had to message the reviewer directly.

Initial attempts at creating additional process

After multiple discussions with the team, everyone agreed to introduce a simple rule: at the start of a day, each developer will spend 15 minutes on code reviews. This, in theory, would provide more than enough engineers to complete all outstanding reviews and keep review turnaround under 24 hours. A few months later, the results of the experiment were mixed. There was still a lot of delay with getting code reviewed. Sometimes changes requested during code reviews had to be reviewed again, and if your reviewers were already done for the day, it would have to wait another day. Due to the slow feedback cycle, reviews still could take days to get completed. Finally, some engineers were not contributing every day due to being busy with other work.

After another brainstorming session, the team identified that one of the issues with poor turnaround was that outstanding code reviews were not easily visible. Because we work with many GitHub repositories, it is not practical to go into each one to see what code reviews (pull requests) are outstanding. The proposed solution was to use a “pin” feature in Slack, our instant messenger program, which will add a code review link to the team channel’s dashboard. When engineers finished reviewing, they would add a 👍 emoji to the pinned code review, flagging it as done. When two thumbs up appeared on a pinned request, the requestor would merge a code review and unpin the item. This was not a complicated process to follow, but there was still confusion and after another few months, outstanding reviews started to linger on the team’s channel board.

From volunteering to assignment

In an attempt to uncover the underlying problem, one of the engineers extracted data on the number of reviews per person and noticed that reviews were not evenly distributed. Some people did a lot more than others. Asking for volunteers was not working very well and also created a fairness problem.

At this point, it was obvious that we needed to even out the distribution and prioritize assigning engineers with the least number of reviews. We also thought about automating this process. It was not difficult to write a script that would pull review statistics and assign people with the least amount of reviews. However, to save time we looked online to see if someone had already solved this problem, and we found a Pull Reminders commercial application that did exactly what we needed, plus other useful features.

Initially, when we decided to give Pull Reminders a try, we weren’t confident that it would solve the problem. However, after everyone was onboard, we were surprised to learn that the issue with code reviews did not come up again during retrospective meetings. We changed our process from volunteering to assignment, based on the leader-board from the Pull Reminders app. When you need a review, rather than asking or posting in the channel, you will assign two people with the least number of reviews from the leader-board. Pull reminders will take care of notifying and reminding people about outstanding code reviews. The app also improved our communication because it sent personal slack messages when a comment or reply was posted in your code review. This tremendously improved response and turnaround times.

Summary

It’s been a year since we’ve started using Pull Reminders, and I haven’t noticed any confusion or disconnect about code review responsibilities. The majority of reviews are done within a day or two. And we can finally call the problem that caused a lot of discussions and inefficiencies resolved. Most importantly, the new system removed additional rules that everyone had to remember to follow. Now, the system enforces and notifies engineers when they need to review code, and it’s hard to ignore.

Coming up with new rules for everyone to follow is easy but sometimes ineffective. A much better approach is to create a system that makes new rules hard to ignore.

After having integrated in-app payments using Apple platform and processing tens of thousands of purchase receipts through the back-end, I’ve learned a few valuable lessons about issues that can only surface in production. Some of these gotchas are either not documented, hard to find in documentation or just strange behavior that only presents itself in a live environment. This is a guide that I wish I had while I was integrating; it would have saved me a few gray hairs.

Empty in_app transactions array in successful purchase receipts

There is a small percentage of successful purchase receipts, when verified via back-end by calling Apple’s API, that do not contain any in-app transactions. In the response received from the API, in_app array field is empty (in_app:[]). The latest advice I found on this issue was from Apple’s technical support engineer in this forum post. He clarified that when the receipt does not contain any transactions, it’s not a valid receipt and you should never call finishTransaction for these types of receipts. There is also another, older solution to this problem that I think is no longer valid and that is asking users to refresh a purchase receipt. As another highly rated support user in that forum post pointed out, the purchase receipt is refreshed before updatedTransactions callback, therefore the receipt should contain updated information and refreshing it is unnecessary.

Strange transaction id formats in purchase receipts

Normally transaction ids are 15 digits like: 400000123456789. But there are times when our server receives a receipt with the transaction in the following, uuid format: FAB60FFD-906D-48CB-8FED-092C4B2707D6

These strange receipt formats when verified with Apple’s back-end also come back with an empty in_app[] array. There is no definite answer on stackoverflow.com or Apple’s developer forums, but it looks like it could be a hack. The best course of action is to not call finishTransaction on a receipt that doesn’t have transactions and is not valid.

Nil SKPayment.applicationUsername for some transactions in production environment

Calling a method on a nil reference guarantees to crash your app and will make your coworkers give you that look that you don’t know what you are doing.

The applicationUsername property of the SKPayment object is one of the fields that may have a value while you are developing and even pass QA. However, for a small percentage of users it will be null in production. And if you didn’t pay careful attention to documentation you are guaranteed to create a really bad experience for a small percentage of your users. In documentation, the description of this property contains an important section that calls out:

The applicationUsername property is not guaranteed to persist between when you add the payment transaction to the queue and when the queue updates the transaction. Do not attempt to use this property for purposes other than providing fraud detection.

What this also is trying to tell you is that sometimes this property will be nil, and if you call a method on a nil property you will crash your user’s device. If you are not careful, after a user makes a payment, his app will crash and will continue crashing on every restart because your application will try to reprocess the payment transaction on every startup. This behavior is impossible to catch testing.

Being careful with nil fields on callbacks

For purchase callbacks:

The following fields are only available when transactionState is SKPaymentTransactionState.purchase or SKPaymentTransactionState.restored.

  • SKPaymentTransaction.transactionDate
  • SKPaymentTransaction.transactionIdentifier

For product callbacks:

If Apple rejects your app and in-app purchase products during the approval process, product data for rejected items retrieved via SKProductsRequest will contain some invalid information. Unfortunately, I wasn’t able to pinpoint the problem because it was happening only in production, and we didn’t have debug symbols uploaded at that time. So to be on the safe side, I opted in for checking every SkProduct attribute for null before reading a value or calling method on the attribute. This is more of a brute force solution, but at least your SKProducts callback won’t crash your app after Apple rejects in-app products.

Keep a strong reference to SKProductsRequest

Be sure not to miss the “Note” section of the SKProductRequest description about keeping a strong reference to the SKProductRequest object.

If your app can receive multiple requests to retrieve products, I would add SKProductRequest references to a NSMutableSet data structure and remove them after requestDidFinish callback fires.

Here is an example for initiating get products info request.

- (void)getProducts:(NSSet*)productIDs {
  NSLog(@"Requesting %lu products", (unsigned long)[productIDs count]);

  SKProductsRequest* productsRequest = [[SKProductsRequest alloc] initWithProductIdentifiers:productIDs];
  productsRequest.delegate = self;

  @synchronized(_productsRequests) {
    [_productsRequests addObject: productsRequest];
  }
  [productsRequest start];
}

Example interface:

@interface IapAppleDelegate : NSObject<SKProductsRequestDelegate>
{
  NSMutableSet *_productsRequests;
}

Removing references after request has be completed.

- (void)requestDidFinish:(SKRequest *)request
{
  @synchronized(_productsRequests) {
    [_productsRequests removeObject:request];
  }
}

Summary

In-app purchase integration with Apple may not seem super complex but the devil lies in the details, and you want to be very thorough with documentation. It will pay dividends to read about every property that you will be using data from and following advice from the “Notes” and “Important” sections of documentation. Don’t rely on examples found on the web or in GitHub alone, as they might be incomplete and could get you in trouble.

Other things like empty in_app array or strange transaction id formats don’t get mentioned in the documentation and are left for developers to discover when the app goes live. Hopefully this post has been helpful in providing you information on how to deal with undocumented scenarios.