Transcript
Ma: My name is Sha Ma. I'm the VP of software engineering at GitHub, responsible for core platform and ecosystem products. Prior to GitHub, I was VP of software engineering at SendGrid, and was part of the leadership team that took the company public in 2017. I'm excited to talk to you about GitHub's recent journey towards a microservices architecture.
The Beginning
GitHub was founded in 2008 as a way of making it easier for developers to host and share their code. The founders of GitHub were open source contributors and influencers in the Ruby community. Because of that, GitHub's architecture is deeply rooted in Ruby on Rails. Over the course of the company's history, we have employed some of the world's best Ruby developers to help us scale and optimize our code base. Today, we have over 50 million developers on our platform, over 80 million pull requests merged per year, and over 100 million repositories across every continent of the world. As you can see, a monolithic architecture got us pretty far. A code base that's over 12 years old, coordinated deploy trains that handle multiple deployments per day. A highly scaled platform serving over a billion API calls on a daily basis, and a fairly performant user interface that focuses on getting the job done.
Rapid Internal Growth
Internally, GitHub went through a significant growth phase in the last 18 months. With over 2000 employees, we have more than doubled the number of engineers contributing to our code base. We've grown both organically and through acquisitions, such as Semmle, npm, Dependabot and Pull Panda. Additionally, GitHub is a highly distributed team, with over 70% of our employees working outside of our San Francisco headquarters prior to the pandemic. GitHub employees and contractors collaborate across six continents, working in all time zones. With over 1000 internal developers bringing a diverse set of skills and operating in a wide range of technologies, it's become clear that we need to fundamentally rethink how we do software development at GitHub. Having everyone learn Ruby before they can be productive, and having everyone doing development in the same monolithic code base is no longer the most efficient and optimal way to scale GitHub. According to Conway's Law, which states, any organization that designs a system will produce a design whose structure is a copy of the organization's communication structure. This also applies in reverse, having a monolithic environment will lead to bigger stakeholder meetings, and more complicated decision making processes because of interwoven logic and shared data that impacts all the teams.
Monolith vs. Microservices
This got us thinking, is it finally time to start migrating out of the Ruby on Rails monolith towards a microservices architecture? If so, how should we go about doing it? Both monolithic and microservices architectures have their advantages. In a monolithic environment, it's easier to get up and running faster, without having to worry about complex dependencies and pulling in all the right pieces. A new Hubber can get GitHub up and running on their local machine within hours. There is some code level simplicity in a monolith as well. For example, you don't have to add extra logic to deal with timeouts or worry about failing gracefully due to network latency and outages. Additionally, because everyone is working in a shared tech stack, and has familiarity with the same code base, it's easier to move people and teams around to work on different features within the monolith and push towards a more global prioritization of features. Because of the way GitHub has grown in the last 18 months, some of the advantages of a microservices environment are starting to look really appealing to us. For example, setting up feature teams with system level ownerships and having functional boundaries through clearly defined API contracts. The teams have a lot of freedom to choose the tech stack that makes the most sense for them, as long as the API contracts are followed. Smaller services also mean easier to read code, quicker ramp-up time, and easier troubleshooting within that code base. A developer no longer has to understand all the inner workings of a large monolithic code base in order to be productive. Most importantly, services can now be scaled separately based on their individual needs.
Be Pragmatic - It's About Enablement
Before we jumped into this transition at GitHub, we spent some time thinking about the why behind our decision, and our goals for making this change. It's a huge shift for us from a cultural perspective, and requires a lot of work. We need to be intentional and think about what problems and pain points we're actually trying to solve. At GitHub, we're doing this so we can enable over half of our developer base, who joined us in the last 18 months to be productive outside of the monolith. The goal for us is enablement and not replacement. Because of that, we need to accept the fact that at GitHub for the foreseeable future, we will be a hybrid monolith-microservices environment, which means it's still very important for us to maintain and improve the existing code base inside the monolith. A good example of this is our recent upgrade to Ruby 2.7. You can read more about what we did and how it made our overall systems better on the GitHub blog.
Good Architecture Starts With Modularity
Good architecture starts with modularity. The first step towards breaking up a monolith is to think about the separation of code and data based on feature functionalities. This can be done within the monolith before physically separating them in a microservices environment. It is generally a good architectural practice to make the code base more manageable. Start with the data and pay close attention to how they're being accessed. Make sure each service owns and controls access to its own data, and that data access only happens through clearly defined API contracts. I've seen a lot of cases where people start off by pulling out the code logic, but still rely on calls into a shared database inside the monolith. This often leads to a distributed monolith scenario where it ends up being the worst of both worlds. Having to manage the complexities of microservices without any of the benefits. Benefits such as being able to quickly and independently deploy a subset of features into production.
Separating Data at GitHub
Getting data separation right is a cornerstone in migrating from a monolithic architecture to microservices. Let's take a closer look at how we approach this at GitHub. First, we identified the functional boundaries within existing database schemas, and grouped the actual database tables along these boundaries. For example, we grouped everything related to repositories together, everything related to users together, and everything related to projects together. These resulting functional groups are referred to as schema domains, and are captured in a YAML definitions file. This is now our source of truth, and it is expected to be updated whenever tables are added or removed from our database schemas. We use a linter test to help remind developers to keep this file updated as they make those changes. Next, we identified a partition key for each schema domain. This is a shared field that links all the information together for a functional group. For example, the repository schema domain, which holds all the data related to repos, such as issues, pull requests, or review comments, uses repo ID as the partition key. Creating functional groups of database schemas will eventually help us safely split the data onto different servers and clusters needed for a microservices architecture. First, we needed to fix current queries that go across domain boundaries, so that we don't end up breaking the product when data separation happens.
At GitHub, we implemented a query watcher in the monolith to help detect and alert us any time a query crosses functional domains. We would then break up and rewrite these queries into multiple queries that respect the domain boundaries, and perform any necessary joins at the application layer. Finally, after all the functional groups have been isolated, we can begin a similar process to further shard our data into tenant groups. With over 50 million users and 100 million repos, functional groups can grow pretty big at GitHub scale. This is where the partition keys come in handy. We can follow a similar process to identify ranges of partition keys to group together. For example, an easy way is simply assign different users to different datastores based on numeric ranges. There's probably more logical groupings based on the characteristics of each data set, such as regions and size. Tenantizing is a great way to limit the blast radius of data storage failures to only a subset of your customers versus impacting everyone all at once.
Start With Core Services and Shared Resources
We have spent quite a bit of time talking about the importance of data separation. Let's switch gears and talk about how to lay the groundwork for extracting services out of the monolith. It's important to keep in mind that dependency direction should always go from inside of the monolith to outside of the monolith, and not the other way around, so we don't end up in that distributed monolith situation. This means when extracting services out of the monolith, start with the core services, and work your way out to the feature level. Next, look for gravitational pulls that keep developers working in the monolith. It's common for shared tooling to be built over time that makes development inside the monolith very convenient. For example, feature flags at GitHub, provide monolith developers peace of mind for having control over who sees a new feature, as it goes from staff shipped to beta to production. Make these shared resources available to developers outside of the monolith and start shifting that gravitational pull. Finally, make sure to remove old code paths once new services are up and running. Use a tool to understand who's calling this service and have a plan to move 100% of the traffic over to the new service, so you don't get stuck supporting two sets of code forever. At GitHub, we use an open source tool called Scientist to help us with this type of rollout, where we can run and compare both the old and the new code paths side by side.
Extracting AuthN/AuthZ at GitHub
The core services that we decided to extract first at GitHub are authentication and authorization. Authentication is pretty complex, because everything needs it. There's a ton of shared logic between the website and Git operations. This means that if github.com is down, then access to Git systems is also down, and Git operations like pull and push will no longer work even through a command line interface. This is why it's so important for some of these fundamental pieces to be extracted to allow primary functions to still happen without having to be tied into the monolith. Authorization for us was much more straightforward, and has already been rewritten as a ghost service outside of the monolith. The current Rails app, aka our monolith, communicates to it using Twirp, which is a gRPC-like service-to-service communications framework, thus needing the inside to outside dependency direction.
Make Operational Changes
Monitoring, CI/CD, and containerization are not new concepts, but making the necessary operational changes to support the transformation from monolith to microservices can yield significant time savings and help expedite the transition towards microservices. Keep the main characteristics of microservices in mind when you make these workflow changes. Operationally supporting numerous, small, independently running services with diverse tech stacks is very different from running a single, highly customized pipeline for a large monolith. Update monitoring from functional call metrics to network metrics, and contract interfaces. Push towards a more automated and reliable CI/CD pipeline that can be shared across services. Use containerization to support a variety of languages and tech stacks. Create workflow templates to enable reusability.
For example, at GitHub, we created a self-service runtime platform to deliver microservices in a box. The goal is to drastically reduce each team's operational overhead for creating microservices. It comes with Kubernetes ready templates, free Ingress setup for load balancing, automatic piping of logs into Splunk, and integration into our internal deployment process. Thus, making it easier for any team that wants to experiment with or set up a new microservice to get started.
Start Small and Think About Product/Business Value
So far, we've covered a lot of ground on the structural changes and shared foundations needed for a successful transition from a monolith to a microservices architecture. From this point on, any new feature should be created as a microservice outside of the monolith. Next, look for a few, simple, minor features to move out of the monolith. For example, features that don't have a lot of complicated dependencies and shared logic. At GitHub, we started with webhook deliveries and syntax highlighting. Use this as an opportunity to look for common patterns and identified gaps before moving on to bigger and hairier functionalities in the monolith. Use product and business values to help determine the right size of microservices.
Look for code and data that are often changed and deployed together to determine features or functionalities that are more tightly coupled. Use these as your natural groupings for what can be iterated on and deployed independently from other areas. Focusing on product and business value also helps with the organizational alignment across engineering, product, and design. Keep in mind, breaking things up too small can often add unnecessary complexity and overhead. For example, maintaining separate deploy keys, more on-call responsibilities, and single points of failure due to the lack of shared knowledge.
Move towards Asynchronicity and Code for Resiliency
Going from monolith to microservices is a major paradigm shift. Both the software development process and the actual code base will look significantly different going through this transition. To wrap up, we will quickly cover service to service communications and designing for failure, both of which are important concepts in microservices development.
There are two ways that services communicate with one another, synchronously and asynchronously. With synchronous communications, the client sends a request and waits for a response from the server. With asynchronous communications, the client sends a message without waiting for a response, and each message can be processed by multiple receivers. We use Twirp at GitHub to enable synchronous communications between the monolith and the core services outside of the monolith like authorization. As more services move outside of the monolith, however, synchronous communication starts to become wildly inefficient as the picture in the upper right demonstrates. It also creates tight coupling between all the services, which ends up defeating the purpose of moving to a microservices architecture. A better approach is to create a shared events pipeline that can broker messages across multiple producers and consumers. This is the architecture we used at SendGrid.
Because services are no longer hosted on a single server, it's important to account for latency and failure scenarios when communicating over the network. A simple retry logic with clearly defined frequency of retries and a maximum retry count may be sufficient to handle most temporary network problems. Consider adding some intelligence to the retry logic using exponential backoff. Instead of retrying requests at a constant interval, exponential backoff will increase the amount of wait time in between retries, and provides some relief to servers that are not responding because of overloads. A circuit breaker can also be added as a medium between services as a self-protection and healing mechanism. For example, after a number of failed attempts, the circuit breaker will open and not allow additional requests to come through until the service is recovered. Set a timeout so your service doesn't end up waiting forever for an external service to respond. Try failing gracefully by presenting user friendly messages or falling back to a last known good state in the cache. Be mindful of the user experience, and do what makes sense for the business.
Key Takeaways
The first four sections focus on the foundational pieces that should be in place before you start down the journey of transitioning from monolith to microservices. Focus on the why. Think about modularity and data separation. Start with core services and shared resources, and make the necessary operational changes. Getting these right will make the transition to microservices a much more enjoyable experience for your entire organization. Then, we talked about where to start and how to tie microservices back to product and business value. Finally, we covered two key concepts in microservices around service to service communications, and building resilient systems.
See more presentations with transcripts
Community comments