Airbnb is going through tremendous growth internationally, evolving from a home sharing company to a global travel community with many product offerings. The growth driven by the business, increase in traffic, and aggressive hiring created a new challenge for the Production Infrastructure Team. The team has grown from a small team of 10 to a production platform organization with 100 engineers that builds foundational services that support homes, experiences, luxury, and China. We shifted our priority and focus to move away from putting out fires to building a platform that can grow with the company. In this session, we chronicle Airbnb’s architectural evolution that aligns with organizational growth strategy, and review how we overcame different architectural challenges leveraging AWS technologies.
- 2015 site stability
- 2016 forward looking foundation
- 2017 service architecture
2015: human as infrastructure
website was down, db was 100% CPU usage. no requests serving users. 24 minute of every hour because of a unknown cron job. cron to rebuild cache.
cdn -> nginx -> request midleware -> rails -> RDS -> web rendering -> rails
p3 homes 19 core tables 71 tables
p4 checkout page 150 core 215 tables
Traffic growth They used new relic mysql
- lack of database observability
- mysql performance schema
- log client (rails) side sql queries
- database load and headroom
- master -> slave rds topology
- database cachcing to save simple queries
- database connection limitations
- more and more web servers started causing max database connection limit
- worked on max(scale) open source project to work on connection pooling.
- 100 hours of downtime in 2015 to 3 hours downtime in 2016
all engineers worked on a single repository. master was released multiple times a day. committers had to be on standby when their code is in a release batch
15 hours per week master was locked while deploys were being worked on.
service oriented architecture
- message bus to decouple services
- dynamic configuration distribution
- some engineers chef
- some engineers redis
- some engineers hardcoded
- zookeeper was a solution
- data stores support rapid product iteration
- some schema changes could take weeks to complete due to size of table.
- mutation events propagation.
- enable developing java services
- was a ruby shop but started to adopt java because of it’s rich ecosystem.
- route queries from database to services (activerecord)
- custom activerecord adapters to route database requests to separate servers.
- SOA Services Architecture
product engineers were hesitant to build services. creating a service was costly.
onetouch service config management
- do all things in one repo for the sourcecode.
- you can deploy/update configuration and change sourcecode in the same repo.
Creating new services became easier.
Service IDL file -> generates a service. UGH!
- can support thrift over http and not only json over http.
Debugging distributed service is hard. Life of request outside monorail
- scraper detection
- user auth
- content moderation
- risk check
- permission control
- core models
- media services
- security services
- payment services
2018: double down on SOA
pdp: product detail page service launched api service framework
- product iteration
- enable airbnb platform
single page service has very good isolation.
new infrastructure challenges
smartstack service discovery stack.
- zookeeper is heart
- operating zookeeper is a painful thing.
- haproxy: service topology changing so configs need to change and haproxy needs to reload config.
- 20K ec2 instances
- restarting nodes is painful when aws sends reboot requests.
DO NOT UNDERESTIMATE INFRA INVESTMENT TO MOVE TO SOA