Learnings from building High Availability(HA) Services

This blog post was cross-posted from DeltaX Engineering Blog – {recursion} where it was published first.

At DeltaX, we have been dabbling with Internet Scale and High Availability for our core tracking and ad-serving services. We have had our fair share of battles, wounds, victories and a host of untold stories. Today, I shall dabble into some learnings keeping the stories for another day.

When designing architecture for mission critical systems the two most commonly discussed aspects are scalability and availability. Most often than not both aspects are used interchangeably. Scalability is about being able to handle increasing load while availability is keeping the system operational by decreasing downtime. Designing Highly Available systems is focusing on the qualitative measures to reduce downtime and eliminating the single point of failures (SPOFs). Here are some learning and thoughts on things to consider while architecting an HA system.

1. Accept Failure

This is contrarian to what we set out to achieve but with all things that start in the head, you have to first get the monkey out of your head. So, if someone comes up to you and informs you that have to build a system which has zero downtime and should be running 99.999% uptime (also called five 9s which is a gold standard). Our first reaction would be to ensure we code in such a way that the system will never fail, handle all exceptions, scale to ensure that it can handle increasing load and hence will never have a downtime. Instead for a second, pause and first accept failure. Accepting failure doesn’t mean that you are building for failure but you accept that irrespective of what you do – it can still fail and so you have to consider, reconsider and plan your system around being able to fail and still keep running.

Next two learnings will talk more about how to fail – like a gentleman.

2. Redundancy, Failover and Recovery (avoid SPOF)

Building redundancy is about ensuring that there are alternate paths in the system to keep functioning (albeit at lower capacity) while failover is switching to the alternate path. The switch over ideally has to be automatic to ensure that there is no manual intervention needed. Once we have a system which fails over it’s very important to have a recovery plan to be able to resurrect the failed path otherwise there is a high chance the will result in additional load and may cause congestion or subsequent failures (snowball-effect). The recovery may be automatic or even manual.

Let’s take a classic example of a web server to understand redundancy and failover.

Schematic user -> web server setup

Now let’s add a load balancer in between and have two servers responding to requests; while the load balancer will ensure that whichever server is ‘healthy’ will be the one receiving requests from the load balancer. As soon as it detects that one of them is ‘unhealthy’ it shall redirect the requests to another one.

Schematic user -> web server setup

Although, this ensures that we have redundancy and also automatic failover – the load balancer in itself is now a SPOF. So, let’s try an alternate setup where we have two load balancers and two servers.

Schematic user -> web server setup

This is a simplistic schematic setup; production systems are more complex and have more moving parts. While we ensure automatic failovers it’s really important to be able to recover from failure. A simple example here could be that once the load balancer detects a web server to be ‘unhealthy’ it’s important to ensure that either we are able to automatically recover by swapping out the web server with a healthy one.

3. Performance Monitoring & Alerts

You can’t improve what you can’t measure. Also, for any HA system monitoring and alerts can’t be an afterthought. Monitoring is ensuring you are measuring health and performance indicators while alerts are ensuring you get timely and actionable information about the system.

Bonus Tip: To see if your system can handle failure, failover, recover and you are able to receive alerts while chaos hits the roof – you can simply log into any of your servers and simply power-off! Think this is a joke? Netflix actually built a tool called Chaos Money to do exactly this. Chaos Monkey is a service which identifies groups of systems and randomly terminates one of the systems in the group.

Architecture for DeltaX HA services
DeltaX Architecture for HA services

We leverage the AWS Cloud to the fullest – right from Route53, ELB, EC2 auto-scaling and S3 for the persistent store. I must note here that adopting the cloud doesn’t really mean that you are set for HA but it definitely makes your job easier with a suite of services and health monitoring system.

For redundancy, we use multiple EC2 instances under an Elastic Load Balancer(ELB) for redundancy. In each of the instances, we have multiple workers running using the Node.js cluster module. Failover and recovery are handled at multiple levels. At the worker level, we have the cluster module which instantiates a worker if one dies; monit monitoring the server process within each instance with a trigger for restart if needed. ELB health checks to route traffic between multiple instances; also to ensure auto-scaling requirements are met. Monitoring & alerts are handled through Amazon Cloudwatch and Amazon SNS.

Overall, we still have some areas in the architecture to improvise upon and further eliminate SPOFs. Like any serious HA architecture – you can’t take anything for granted; if you do the Chaos Monkey may strike.

Is the Future of Application Architecture – Serverless?

This blog post was cross-posted from DeltaX Engineering Blog – {recursion} where it was published first.

Advancements by cloud-based IAAS providers (Amazon Web Services, Google Cloud and Azure have made on-demand scale and flexibility a reality. Today, as a startup you don’t need to worry about over-provisioning infrastructure, forecasting growth and go over long-term infrastructure contracts to meet your demands. Interestingly, a new suite of cloud services are questioning the very existence of a core aspect of common application architectures – the ‘server’ and are coined as serverless.

What is the ‘server’ in serverless?

Let’s say you wanted to run a service on the cloud – for this, you would need to do the following:

  • Decide the type of computing resources you need. Instance type, cores, memory and storage space.
  • Choose an OS / Machine image to install on the instance
  • Setup / deploy your service

Steps 1 & 2 above constitute the ‘server’ in the serverless paradigm and in effect, these are the steps you wouldn’t have to worry about. All you need to do is to choose your execution environment and submit your code.

Available Options

When it comes to the serverless paradigm – each of the major cloud IAAS providers have launched their own options. Here is a quick summary of options available:

IAASServerless ParadigmSupported EnvironmentsProduction-ready
Amazon Web ServicesAWS LambdaNode.js, Java, Python, C# (.NET Core) 
Microsoft AzureAzure FunctionsNode.js, C#, F#, Python, PHP, and shell 
Google CloudCloud FunctionsNode.js 

RefClick here for a detailed comparison on Stackoverflow

There are slight differences in the extent of support and capabilities but the process to initiate works as follows:

  • Select a development environment
  • Choose the amount of memory, execution timeout etc.
  • Setup a trigger for launch
Proof of Concept

In part, to test drive the paradigm and at the same time build something useful, I worked on two POCs.

Azure Function: Cachewarmer Function

When it comes to our web application, we use Entity Framework as the ORM. Considering the multi-tenant nature of the application and the volume of tables – context initialization takes an unexpectedly long time. It’s for this exact reason we had to build a mechanism to warm the context cache to initialize it and keep it ready for external requests.

Trigger: CRON

Dev Environment: shell

Description: I cooked together a sequence of cURL requests to make pings to a special endpoint on the web application which initiates a context load. Considering we have over 500 tenants we had to batch a series of requests and to avoid hitting the max execution time I had to split this into two separate functions.

Honestly, this was really a trivial function, but it is exactly why having a serverless architecture was justified. Not to forget, we were up and running within 20 mins.

AWS Lambda: Slackbot dxdb

This was in retrospective a solid use case. Let me take a deep dive onto this one:

Purpose: As noted earlier, we have over 500 tenant databases. When it comes to querying the databases – it’s pretty cumbersome to connect to them individually using SMSS and then run individual queries. When it comes to executing small queries to check data; it would be pretty useful to simply fire the query in the Slack channel and see the results. An unexpected consequence of using Slack is also that one can fire the query from the Slack mobile application as well and see the results on the go.

Features Supported:

  • Detect the DB to connect with intelligently from the schema
  • Support delayed response. Some queries can take longer to execute while Slack for an immediate response has a window of 3 seconds.
  • Formatting output to the extent possible
  • Minimal error notifications

How it works? Slack command dxdb

  • Every invocation of the command makes a POST request to the AWS API Gateway with the command and the request text; in our case the query.
  • The AWS API Gateway invokes the AWS lambda function dxdbExecuteSQL and passes the request params. Tip: The AWS API Gateway is probably the most underrated yet one of the most powerful and flexible services AWS has launched. Will explore this in the future.
  • dxdbExecuteSQL function authenticates the request, does minimal checks on the kind of queries (in our case only read-only) and does two things.First formats the intermediate response in the form of MSSQL prompt to be sent back to Slack through the API gateway. Next invoke the dxdbDelayedSlackResponse lambda function.
  • dxdbDelayedSlackResponse lambda function parses the query, identifies the tenant, fires the query, reads the results, formats the response and makes POST request back to Slack.

Although the setup is complex and layered, I only had to focus on the workflow and the business logic; the effort of picking an instance, setting it up and keeping it running was not something I had to worry about. Another interesting thing about this setup is that – the function is not running all the time, it is only executed on invocation and the icing on the cake is that you are only billed for the time it executes in increments of 100ms.

Code: Project is available on Github.

Follow-up Thoughts

Going serverless is an extension of adopting the cloud but demands a change in the thought process of layering your architecture. The recent trend around microservices-based architecture also fits well with the serverless paradigm.

Interestingly, each of the cloud services offers a minimal code editor. I can see how in the future you could probably have a full-fledged IDE available at your disposal. Looking at the pace of innovation, we are another step closer to not just programming for the cloud but literally in the cloud.

Video Transcoding on the AWS Cloud

This blog post was cross-posted from DeltaX Engineering Blog – {recursion} where it was published first.

Video ad-serving is a complex beast given the sheer expressiveness of the medium and unpredictable client-side bandwidth that’s required. At DeltaX, our ad-server is now also Youtube certified for VAST in-stream ads. VAST (Video Ad Serving Template) is an XML-based standard defined by IAB standard for video ad-serving. In the case of video ad-serving, the ad-server responds with multiple video assets in different formats, resolution, and bitrate while the VAST compatible video player picks the most appropriate video asset based on the host platform, bandwidth, and other client considerations. For this to work as expected, it’s important to transcode the media file provided by an advertiser to different formats, resolution, quality and specs beforehand.

Setting up a Elastic Transcoder Pipeline on AWS

Transcoding is the process of converting a media file from one format, resolution, quality and specs to another. In the past, a transcoding pipeline would require a lot of heavy lifting on the software and hardware front. Today, using the cloud you can setup a transcoding pipeline in a matter of minutes. Considering we use Amazon Web Services to host and scale our ad-server – the Amazon Elastic Transcoder was a great fit. Expectedly, it also plays well with Amazon S3 and Amazon Cloudfront.

Here is how we setup the video transcoding pipeline for a VAST ad-server:

1. Create Custom Presets
Custom Presets

Here you can start with a pre-existing preset. Amazon Elastic Transcoder provides comprehensive options to specify the codec, bit rate, number of key frames, sizing policy and aspect ratio.

VAST Presets

At DeltaX we have fine tuned our presets to optimally be able to serve for all platforms.

2. Setup a Pipeline
Transcoding Pipeline

Pipeline acts as a queue for various transcoding jobs. It also helps you configure the Amazon S3 source and output buckets.

3. Setup a Job
Transcoding Job

Here is where you specify the input source file and choose one or many output presets (configured in step 1) to generate transcoded output files.

4 Job status and Completion
Transcoding Job Status

You can track the status of your job on their dashboard manually. Once the status is complete you can visit the bucket/prefix and see the transcoded files.

Transcoding Job Input
Transcoding Job Output

You can see how a 720p HD file was transcoded along with thumbnails of output files of varying resolution and bitrates. If you notice the original file size and the ones which were transcoded, you would have already figured out the amount of bandwidth saving along with ensuring that the user wouldn’t have to wait very long for the video ad to load.

Closing Thoughts

This is a classic example of how with the emergence of the cloud ecosystem infinite scale and on-demand can go hand in hand. For startups, the cloud is an amazing leveler to help innovate and get to market faster.

Look forward to sharing more tidbits, optimizations and architecture considerations while building the ad-server in follow-up posts. Ending with a quote (modified to suit the blog post) from one of my favorite movies TROY – “If they ever tell our story let them say that we walked with giants. Startups rise and fall like the winter wheat, but these names will never die. Let them say we lived in the time of Azure, tamer of the Microsoft stack. Let them say we lived in the time of AWS.”

Math.AI using Personality Insights service from IBM Watson

How it all started – Xhacknight Hackathon

Me and Amrith spontaneously decided to make it to the Xhacknight organized by XHackers and sponsored by IBM Bluemix. It turned out to be an amazing learning experience.

What we built – Match.AI

1. Reviews are everywhere
With the boom in user generated content (UCG) – reviews are everywhere. Be it the kind of product you are buying, restaurant you want to eat out or places you want to stay at. Each of us do scout through reviews for respective opinions before making a decision.

2. Sadly – Online reviews are broken!
Each person’s likings and preferences are different. None of the UCG sites allow users to take into consideration the person’s personality match to that of the reviewers.

3. PersonalityMatch Index (PMI)
Using Personality Insights service from IBM Watson we were able to capture the personality profile of a user. On top of this, we built a PMI algorithm to match two different personality profiles. Using IBM Bluemix we were able to deploy this service using Node.js and expose it through REST endpoints (/map and /match)

4. Xamarin – Demo Reviews Mobile App
We also built a barebones Xamarin mobile app for iOS and Android while allow users to login through Facebook Connect, build a user personality profile using his Facebook data and then finally use the REST end-points to show a PMI rating along with the reviews.

5. Use-cases through API
The Match.AI PersonalityMatch Index REST API we built was generic and could be used for quite a few use cases:
– User Review Personalization
– Dating App Recommendations
– Resume Classification

Reference Links

API service: https://github.com/whiteboardmonk/ibm-bluemix-pi-xhacknight
Xamarin app: https://github.com/amrithyerramilli/xhacknight-xamarin-pmi-app

Presentation after Hackathon

social|median design contest and invites

social|median social|median harnesses social filtering to help you keep up-to-date with news that matters to you.

From their website:

socialmedian is a social news service that connects people with personalized news and information. socialmedian enables you to easily keep up-to-date on the news that matters to you and to people who share your interests.

Currently, they are in ALPHA (heard of web applications in BETA before? eg: Gmail. Alpha is a milestone just before BETA 😉

What I like about them the most is their approach. Jason the founder and his team at TrueSparrow are not churning up code all the time (like you would like to believe of a product in ALPHA) but constantly looking for feedback from their ALPHA users. At the time of writing this post they already had 2468 alpha users signed up and most of them active.

Now, it is this approach of theirs which I highly appreciate and when Jason announced the social|median design contest I had to pitch in.

I throughly enjoyed the whole experience of participating in the design contest while the product is in its ALPHA and probably help shape its vision/roadmap. My participation in the contest is a story in itself and it will probably unfold sometime soon on this blog (may be partially).

Interested in seeing my entry for the contest? For you all I have a special annotated version of the files I sent to Jason unlike the ones uploaded in the contest.

(I would recommend you to see the first and the third at its original size by clicking on the preview images below. The notes would also help you figure out what’s the application about.)

User’s signed in homepage:


User’s signed in homepage after clicking on expand profile:


Web 2.0 network on social|median:


You can go through all the entries here. To vote for me you can follow this link. I would highly appreciate if you take time out and make an informed decision by going through all the entries. Since, all the designs are also effectively ALPHA 😉 do leave your comments here or on the individual design pages for the designs.

Thanks for taking time out and participating in the voting process. Looking forward for your feedback.

Jason has been grateful and has extended some ALPHA invites to social|median for you! Grab them soon before they are gone. Leave me a message along with your email and I shall send you the code for the ALPHA invite.

Results are out:
Results are out and you can read more about it here. I didn’t win but I hope I did add value to the whole process. In all I am extremely happy with the contest and my entry for the feedback I received from fellow contestants, unknown friends (from what I noticed there were atleast 3 comments which went missing from that post after I read them. so if you see your feedback missing do push it to me on my email as I really value it) and well wishers. Thanks once again for your support.

Look who’s talking about PHP :)

David, creator of Ruby on Rails recently made a post on his blog related to PHP. He did have some good things to say about PHP and as he rightly pointed out that it deserves some more respect from the community in general.

From his blog post:

I’ve been writing a little bit of PHP again today. That platform has really received an unfair reputation. For the small things I’ve been used it for lately, it’s absolutely perfect.

I love the fact that it’s all just self-contained. That the language includes so many helpful functions in the box. And that it managed to get distributed with just about every instance of Apache out there.

For the small chores, being quick and effective matters far more than long-term maintenance concerns. Or how pretty the code is. PHP scales down like no other package for the web and it deserves more credit for tackling that scope.

I have always respected and utilized them for tasks which leverage their inherent strengths 🙂

Recent Category Posts – K2 Sidebar Module

Do use WordPress? Heard of K2? If not, then you really have to give it a good look.

From the K2 website:

K2 is an advanced template for the blogging engine WordPress developed by Michael Heilemann, Chris J Davis, Zeo, Steve Lam and Ben Sherratt.

It won’t make you coffee, sing songs of sweet regret or sit at your bedside when you’re ill, but it might make life just a tad bit easier for you.

Think of it this way: Where WordPress is everything that goes on behind the scenes, K2 is everything that reaches the readers of your blog.

WordPress itself takes care of authenticating users, fetching and sending data to and from the database and provides you with the backend administration interface.

K2 on the other hand is the frontend of WordPress. It’s main concern is displaying the data fetches through WordPress in the right way at the right time. Furthermore, where more basic themes like Kubrick have little situational awareness, K2 cares about you and is always trying to make sure you are presented with exactly the tools and data you need.

K2 supports syles. Much like the main theme can be styled by using different CSS files.

K2 also has a stellar Sidebar Module(SBM) which is much like WordPress Widgets on steroids. Though, WordPress Widgets do have a more cleaner approach as they hook onto the WordPress Plugin API. On the other hand, SBM has a better UI to configure and more granular control for the users. Couple it up with solid API for the programmer. You can even disable SBM from the K2 admin panel to support WordPress Widgets.

So, here is my first SBM for K2.

Recent Category Posts – K2 Sidebar Module

Someone also had requested this as a sidebar module at the K2 forums.

The programmers and the ABCDEFG problem

Again and again I am reminded of the ABCDEFG problem I read in “The Nudist on the Late Shift — and Other True Tales of Silicon Valley” by Po Bronson.

“The ABCDEFG Problem.” I call it that because all good programmers have tons of choices to work on, A through G. Some choices seem cooler and some seem dumber, some possible and some improbable, but as to the payday lurking behind the door, they all look alike. They’re just A through G, take your pick. Choice A may be 3DO, and choice G may be $2 million of Microsoft stock, and Choice C may be a quarterback with four Super Bowl rings, but you just don’t know. It’s sort of like choosing one million units of foreign currency by which country’s paper bills have the splashiest colors, or making a million-dollar bet on the NCAA basketball tournament by whichever team has the sexiest cheerleaders. The variables that programmers have to go on (A-G) are not the variables that determine the outcome (X, Y, and Z).

And rightly said, “The variables that programmers have to go on (A-G) are not the variables that determine the outcome (X, Y, and Z).” So, do what you love and have fun coding 🙂

You can read an excerpt from the book here.

P.S. Hanisha, I still have the book I borrowed from you. Thanks 🙂

ErlyWeb: ErlangOnRails

I have been playing around with Erlang for some weekends now and I find it to be really interesting. I like the functional programming paradigm too. The syntax reminds me a little of Prolog (from my engineering days).

Programming Erlang: Software for a Concurrent World is the Pickaxe book for Erlang written by Joe Amstrong. Joe designed and implemented the first version of Erlang in 1986.

What’s all this fuss about Erlang?. This and recent articles on reddit got me interested in Erlang.

ErlyWeb - The Erlang twist on Web Frameworks

Yariv Sadan has also developed a web framework on Erlang named ErlyWeb. ErlyWeb is in its 0.7 version and has some Rails kinda feel to it but doesn’t look as clean as Rails because its on Erlang. Not many people get it but its actually because of Ruby that Rails does most of its magic its said to do. ErlyWeb is a commendable effort and I really want to give it a fair shot before I form a opinion about it.

I have a very interesting project in mind where Erlang fits the bill perfectly. More about it sometime later 😉

Enhance GMail – Use keyboard shortcuts

I have made a switch from Yahoo to Gmail for my primary mail address in the last few months and its been a pleasant experience.

Have been using the keyboard shortcuts of Gmail now and I must say I am addicted to it. Highly recommended for anyone who loves to use the terminal or like editors like emacs and vi. I feel the shortcuts have more of a vi feel to it.

I would recommend Gmail if:

  • Volume of mail you have to respond to is huge. On Gmail alone I have had 226 mails in my sent folder in the last two months. Most of them were related to CodeCampMumbai
  • You want to receive the registration confirmation mails quickly. I have noticed that its faster on Gmail than on Yahoo.
  • Discover Archiving. Yeah, archiving is beautiful.
  • You are looking for a GTD app.(link)
  • Oops, I just figured out that I can go on and on…

Which reminds me I have to update my Contact page.