By Deepak Bobbarjung
At Passage AI, we offer a platform to build, manage, and deploy AI powered chatbots. Our technology stack comprises of a set of microservices that perform both message handling via AI and also bot configuration and management. In this blog, I will describe our journey of migrating our microservices from AWS ASGs (auto scaling groups) to Kubernetes.
During the early days of the company we created a Jenkins CI/CD pipeline based on multi-branch pipelines and also moved to a model where we hosted each microservice in our architecture as an AWS Auto Scaling Group that could be configured to be scaled up or down based on load. Each code deployment of a microservice would involve updating the code in the current instances of the ASG and then also updating the AMI/Launch Configuration backing that ASG. While this worked, we found that the deployment process would take a long time, especially during AMI updates. Further, the deployment process would fail intermittently during AMI updates. This would mean that when the ASG would try to scale up, the newly spun instances would often be from a non latest version of the code, resulting in inconsistent behavior.
During those early months, we kept getting requirements by the business team to be cloud agnostic for the following reasons:
1. In the field, we will encounter customers that will work with us only if we were hosted on a particular cloud or a particular region of a particular cloud.
2. We will encounter customers who are concerned about moving to the cloud, and will ask us if we can provide an on-prem solution that they will host.
3. As a startup we had credits with all major cloud providers that expired at different times. We wanted to be able to leverage the credits we had across all our cloud provider partners, and not just our AWS credits.
So while sticking to one cloud provider is a reasonable choice in the very early days of a startup, in our case, we concluded that it was in our long term interest to design our infrastructure to be cloud agnostic.
However, designing infrastructure to be cloud agnostic comes with a set of challenges. The way you configure and manage scalability, redundancy and load balancing of traffic (just to name a few aspects of microservice management) differs significantly from one cloud to the other. This has the following implications across the engineering organization.
1. This implies the CI/CD tooling we have to write to deploy new code would differ based on the cloud we are deploying to.
2. Microservice owners in charge of reliability and scale for their microservice will have to learn APIs or processes to configure auto scaling and reliability on each of the clouds that we support.
3. Developers will need to understand the process of testing and debugging their code running on all major cloud providers. Integration tests and stress tests would need to be written to potentially test against code running across multiple clouds.
For a small company like ours, the above concerns make it virtually impossible to run our infrastructure across multiple cloud providers. We want most of our engineering team to contribute to our core product which is our Natural language processing (NLP) and bot builder platform rather than have to write devops automation to support three or more cloud providers. And finally, even if we somehow solved the above challenges, we would still not acquire the ability to host our services on our customers’ on-prem datacenters.
Given the above concerns, it made sense to explore a way to deploy and manage our fleet of microservices in a cloud agnostic way. After doing some due diligence, we chose to move all of our services from AWS ASGs to Kubernetes. Kubernetes acts as an abstraction layer that can run on any of the major clouds. It can also run in our customers’ on-prem cloud if necessary. Once we brought up Kubernetes clusters for our different environments (integration, staging and production), we switched all our CI/CD tooling to deploy our microservices onto a Kubernetes cluster rather than to a specific cloud. Similarly we configured scale and high availability using Kubernetes commands and scripts rather than scripts or tools that are specific to a cloud. Yes, microservice owners and developers are now required to learn the constructs needed to deploy, debug, and configure availability and autoscaling of their microservice in Kubernetes. On the upside, they do not have to care about the underlying cloud infrastructure that Kubernetes is running on. Given the rise of Kubernetes as the de-facto standard for orchestrating and scheduling microservices, we anticipate that our development team will benefit by adding a bit of Kubernetes knowledge under their belts.
I won’t claim the transition from AWS ASGs to Kubernetes was easy. The challenge was that we had bots in production with hundreds of thousands of users per day, and we had to make the transition while ensuring zero downtime to our existing customers.
To support our AWS ASG pipeline, we had written Jenkinsfile pipelines for each of our microservices that would deploy the latest code to AWS ASGs in staging and production environments and then also update the AMI for the staging and production ASGs.
The plan was to create a new integration (INT) environment in K8s only, then move our AWS staging environment to K8s, and finally move our AWS production environment to K8s. These were the steps we followed to transition to Kubernetes.
1. Bring up Kubernetes clusters for our 3 new deployment environments — integration, staging and production.
2. Create a common docker registry for all the environments and dockerize all of our existing microservices.
3. Create configmap, deployment and service files for each of our microservices and for each of our environments.
4. Deploy nginx reverse proxy on each of our environments with routes configured for each of our microservices.
5. Change our Jenkinsfile pipeline to create/apply a Kubernetes deployment from the latest docker image during the integration and staging deployments instead of updating the corresponding AWS ASG.
6. Switch the routes for our external microservices running on staging to point to the new routes exposed by nginx running in the new staging environment. Create new routes for our integration environment and point them to the K8s integration environment.
The above steps effectively switched our integration and staging environments from AWS ASGs to Kubernetes. We then let things simmer for some time allowing our engineering team to familiarize themselves with Kubernetes. During this time, we were still running our production environment on AWS ASGs. This was not an ideal scenario, but was worth it as it gave us time to understand several considerations of running our code on Kubernetes, such as sizing, performance and latency. We were able to conduct several stress tests and monitor performance. These tests gave us the confidence that we were finally ready to switch over our production environments to Kubernetes
Switching Production to Kubernetes
The major concern with moving our production environment to Kubernetes was to ensure zero downtime for our production users. We identified an order in which our microservices could be switched over to production. This order guaranteed that at any given step, it was ok for some of our microservices to have been switched over to K8s and for the rest to still be running on the original AWS ASG environment. We also set up a mongo mirror to continuously replicate our mongodb database from AWS East region to the Azure region where we were hosting our Kubernetes environment.
Following the above processes allowed us to perform the migration with truly zero disruption to our customers and their users. We now have all of our microservices running on Kubernetes on all our environments.
Reaping the benefits
Transitioning to K8s has allowed us to become cloud agnostic giving us the ability to deploy our services on any cloud or on on-prem datacenters. We are also realizing the benefits of dockerizing our microservices — for example, our CI/CD pipeline takes much less time now that we are creating a docker container with every code deployment instead of creating a new AWS AMIs. Scaling up in Kubernetes is fast and reliable — as new docker containers can be created in a matter of seconds as opposed to spinning instances from AMIs which would take minutes. Using configmaps also allows us to use the exact same image for a microservice across all our environments, whereas previously we had to maintain a different AMIs for staging and production.
We realize we have only scratched the surface in terms of leveraging all of Kubernetes’ capabilities. We have several exciting features on our devops roadmap including blue/green deployments, creating developer sandboxes, scheduled jobs, collecting microservice metrics by setting up API proxy via sidecar containers and exploring training of machine learning models via Kubeflow. We are excited about the possibilities and will continue to leverage Kubernetes to provide additional value to our developers, allowing them to focus the bulk of their efforts on building the best platform for AI powered chatbots.
For more information about Passage AI, please look us up at: http://www.passage.ai