Inside Agoda’s Private Cloud: Part 1
The evolution of Agoda's data centers; the hardware the company runs on, and a tour of the DCs. Exclusive details.
👋 Hi, this is Gergely with a 🔒 subscriber-only issue 🔒 of the Pragmatic Engineer Newsletter. In every issue, I cover challenges at Big Tech and startups through the lens of engineering managers and senior engineers. Subscribe to get issues like this, weekly:
In a previous two-part series, we dived into Uber’s multi-year project to move onto the cloud, away from operating its own data centers. But there’s no “one size fits all” strategy when it comes to deciding the right balance between utilizing the cloud and operating your infrastructure on-premises.
To show the complexity of this choice and the ways tech businesses approach it, this article brings the inside story of one large tech company that’s decided against onboarding to the cloud – at least for now.
Agoda is a leading online travel booking platform in Asia. It’s owned by Booking Holdings Inc, which also owns the popular travel sites, Kayak and Booking.com. Unlike Uber, Agoda does not make use of public cloud providers, having decided to build out its own private cloud, instead.
To learn more, I reached out to Agoda’s CTO Idan Zalzberg. In this new mini-series he shares exclusive details about how Agoda built and operates its data centers. In two articles we do a deep dive into the machines the company uses, evaluate Agoda’s decisions and also consider how they stack up with what Uber did. By the end of this series – with the help of Idan’s valuable insights – we hope to provide alternatives for midsize companies deciding their own cloud strategy. Spoiler: as mentioned above, there’s no one-size-fits all solution! We try to be as open as possible about what works for Agoda, what did not and why.
In today’s issue, we cover:
Agoda in numbers. The number of developers, physical cores, data centers, and more.
Internet service provider (ISP) basics. What are tier 1, 2 and 3 ISPs? Why does it matter? And why does Agoda connect to a Tier 1 one?
Data center tiering. How tiers 1-4 for data centers measure up, and which tiers do popular cloud providers certify as? Why does Agoda co-locate with Tier 3-or-above?
The evolution of data centers at Agoda. From blade servers and Windows machines in 2012, to an in-house, Kubernetes-based orchestrator development system today.
The hardware inside Agoda’s private cloud. 64-core compute nodes of around 20 servers per rack, Top-of-rack (ToR) switches and plenty of redundancy.
Data centers (DCs) and availability zones. How does the company organize its two regions, and which services use active-active or active-passive DC setups?
In Part 2, we additionally cover:The application stack inside the private cloud. Fleet, Buckbeak, Agoda’s detailed data stack and bespoke developer portal.
Agoda’s cloud strategy and usage. Is Agoda’s goal to operate off the cloud, or on it? What are the use cases where the company already utilizes public cloud?
To move or not to move to the cloud. How Agoda knows if it’s time to move away from on-prem servers. Are location-based expenses keeping the company off public clouds?
A surprise advantage for hiring software engineers. Owning their own stack end-to-end comes with unexpected hiring benefits, when attracting developers.
Agoda’s learnings from operating its own DCs. The importance of standardizing, and why to minimize your tech stack.
The cloud or your own data centers? Idan’s advice for any midsize company weighing up on-prem versus public cloud.
1. Agoda in numbers
Agoda lists 3.6M hotels and holiday properties worldwide, and its apps and website appear in 39 languages. The company sees 80K searches per second at peak traffic, and serving them all involves calculating 10M different “accommodation rates” per second. Interestingly, the majority of these searches are not by holidaymakers “browsing right now” during their free time – most searches are by partners, affiliates and search engines!
The company employs about 6,600 people in 31 markets, with its headquarters in Singapore. Around 1,600 people work in engineering, including software engineers, data science and business intelligence (BI) teams, and the DevOps team. The majority of the engineering team is in Bangkok, Thailand.
Among the 1,600 tech workers, the hardware team numbers in the low tens, and their job is to maintain the hardware and ensure it operates in DCs as expected. This group doesn’t include the software layer for infrastructure, which is a software team that builds the orchestration platform (Fleet) upon Kubernetes.
Although the company employs about 4-500 people in its data organization, there’s been no single dedicated “data team” for around 2 years. Agoda moved away from this model and its data engineers are embedded into each team. Within the data org, the distinct roles of data scientist, data analyst and data engineer are defined. Within data engineering, there is currently no separation between data engineers and machine learning (ML) engineers; individuals take on both roles. We go deeper into ML in What is ML engineering?
The company works in Scrum teams, which typically contain a product manager, developers, data scientists/analysts/engineers and BI engineers. The goal of having all these people is to create a shared business purpose.
Infrastructure-wise, the company operates around 6,500 servers, with a total of approximately 600k virtual cores (vCores) and 300k physical cores. The company’s largest data cluster is 20-30PB (petabytes: 1PB is 1,000 terabytes or 1M gigabytes). Ten years ago, this data cluster was 300GB as a Hadoop cluster; that’s around a 100,000-fold increase in data stored!
The company runs 4 data centers: in the US and Europe, with two in Asia. Agoda co-locates in all data centers, leasing space for its racks and the largest data center consumes about 1 MW of power. It uses Spark for the data platform. For transactional databases, it’s mostly the Microsoft SQL Server, but also other databases like PostgreSQL, ScyllaDB and Couchbase. At peak load, Agoda sees around 7.5M queries per second as total load, spread across its managed database-as-a-service (DBAAS.) The company uses HP servers, VAST hardware for object storage. Agoda utilizes Akamai as its CDN vendor.
2. Internet service provider (ISP) basics
Agoda uses Tier 1 (the “best”) network providers. To understand what this means, let’s do a quick primer of ISP tiering. Here’s some terms we’ll use:
The internet doesn’t require much explanation. It is, of course, not just one thing, but is a massive network of networks. Different organizations own different parts like the fiber optic cables, routers, and the switches. For data to get from A to B, those bytes need to pass through networks of cables, switches, routers owned by different organizations.
Peering refers to connecting networks between two providers to exchange traffic. Peering is settlement free, meaning neither party pays for sending traffic to the other network. Providers strike peering deals because they each make revenue from customers, who do pay to access a network. Peering happens at internet exchange points (IXP) where different providers connect their networks via routers and switches.
IP transit (internet transit) allows data through a network. IP transit deals are typically struck between a smaller internet provider and a larger one. The bigger party agrees to forward data to any part of their network, usually for a fee, and payment may be based on volume of data, or a flat-rate fee.
Now let’s look at tiers. Bear in mind the tiering system is informal and is ever-changing. As internet providers grow, they may change tiers. For example, there’s no agreement on whether Comcast is a Tier 1 or Tier 2 provider; Comcast still purchases some IP transit internationally, but is peered with the vast majority of Tier 1 providers.
An internet exchange point (IXP) is a physical location where internet infrastructure companies like internet service providers and CDNs connect. These points allow network providers to share transit outside of their own network. IXPs usually contain lots of network switches; here’s the inside of the London Internet Exchange (LINX) IXP:
Tier 1 internet providers / networks are the biggest and do not purchase transit services from any other network. They can reach every other network on the internet via settlement-free interconnection as they have peered with them all. These providers each maintain between 30,000 to 900,000 kilometers (19,000 to 560,000 miles) of fiber routes. It’s these which can be thought of as the “backbone of the internet.”
So Tier 1 networks exchange traffic with fellow tier 1 networks for free. Tier 1 networks do this by owning and maintaining their internet infrastructure, and by having negotiated settlement-free deals with other tier 1 providers.
There are just over a dozen tier 1 providers. The list, based on Wikipedia:
US: AT&T, Comcast, Verizon, T-Mobile, Lumen Technologies (formerly: CenturyLink and Level 3), GTT Communications, Zayo Group
Europe: Deutsche Telekom (Germany), Orange (France), Telecom Italia (Italy), Telxius (Spain), Liberty Global (Netherlands), Arelion (Sweden)
Asia: Tata Communications (India), NTT Communications (Japan), PCCW Global (Hong Kong)
Although Tier 1 networks are the biggest, these networks are often the furthest away from consumers and businesses. The above list omits many major networks most people would consider as Tier 1, such as China Telekom (China,) Singtel (Singapore,) Vodafone (UK,) or Telstra (Australia.) All these networks are close to tier 1 status because they are connected to many Tier 1 networks and can reach around 50% or more of the internet, settlement-free. But not all of it, crucially.
Tier 2 internet providers / networks have struck up peer agreements with other networks, and also purchase IP transit to reach some parts of the internet. Tier 2 providers are the most common class of internet providers. This is because it’s easier to purchase IP transit from Tier 1 providers than to build out your own routes and negotiate peering with them all. Some larger Tier 2 providers are:
Europe: British Telecom (UK,) Vodafone (UK,) Tele2 (Sweden,) TDC (Denmark,) Turk Telekom (Turkey)
Asia: Korea Telecom, Telkom Indonesia, Singtel (Singapore,) HGC Global (Hong Kong)
Australia: Telstra, Optus (owned by SingTel)
Canada: Fibrenoire
Note that many tier 2 internet providers are considered “tier 1” within their home country as they often set up peering with all major local players. For example, this is the case with many Australian internet providers. However, they still need to pay IP transit fees to access some parts of the internet.
Tier 3 internet providers / networks purchase IP transit from other internet providers. These are usually local providers who connect end-users and businesses to the internet, and usually purchase IP transit from Tier 2 and/or Tier 1 networks. They tend to focus on the “last mile” of connectivity. Local telco companies and small broadband companies tend to be Tier 3 providers.
Let’s map all these together:
The largest tech companies have always integrated directly with Tier 1 providers, striking deals to access their networks. Here’s a diagram from 2014, analyzing BGP (Border Gateway Protocol) relationships to show which companies integrated with which US Tier 1 providers:
What’s the upside of connecting with Tier 1 providers? The biggest is to reduce network traffic costs. As an added bonus, being colocated with Tier 1 providers can also improve latency, somewhat. Although to improve latency, the smartest choice is to use a content delivery network (CDN) which has hundreds or thousands of cache locations located at Tier 3 internet providers, very close to end users. The most popular CDN providers are Cloudflare, Akamai and Fastly.
We cover more on CDNs – also known as Edge Network providers – in Uber’s move to the cloud. We also cover median compensation packages Cloudflare, Akamai and Fastly – and who pays the most – in Compensation at publicly traded tech companies.
There are tech companies which go beyond CDN providers by building their own CDN. Netflix does this with Netflix OpenConnect, which is the company’s globally distributed CDN. By 2016, the company had already integrated with hundreds of internet service providers directly and was plugged into dozens of internet exchange points (IXPs):
Agoda – like other major tech companies – purchases circuits from Tier-1 providers NTT and GTT, leveraging the geographical coverage and intra-provider routing that’s possible by having the same ISPs everywhere. This avoids datacenter-to-datacenter traffic even leaving the ISP private network.
3. Data center tiering
Agoda co-locates its DCs with major providers of Tier 3-or-higher facilities. A Tier 3 DC facility is a different type of tier from the classification system for ISPs.
ISPs have a more unofficial and fluid classification, but this is more formal for data centers, as defined by the Uptime Institute. There are 4 tiers, with 1 the lowest and 4 the highest – the opposite order of ISP tiering. Each tier is progressive, meaning tier 3 incorporates all expectations of Tier 2: