Scaling Medium-Size Databases
Probably the first symptoms of a web application in need of scale come from the database. Whether they’re connection errors for exceeding pool capacity or performance bottlenecks related to slow queries and big tables, it’s expected that the database is the first piece you’d need to start troubleshooting in order to scale.
There are some very basic things you can start doing right away such as improving indexes, adding more memory or isolating it on a dedicated server; on this article, however, I’ll asume we already did so and focus on some medium-level techniques instead.
Specially on big tables, long queries such as SELECT will start slowing down pretty soon. Master-slave Replication is about cloning the master database into a read-only slave and divide read queries between them. You can have as much slaves as you want.
Writes are performed on the master and replicated through its slaves in a synchronous or asynchronous fashion, choosing between the two would depend on your needs or even on your DBMS support, but in general the tradeoff is simple: synchronous is more accurate but slow while asynchronous is faster and maybe not so reliable.
This technique is about moving specific tables away to its own database so that you’d end up with more than one different database. It sounded crazy to me the first time I read about it, but turned out to be a common approach specially for applications having tables with way more read/write demand than others.
Normally you would identify those tables after an extensive analysis of queries and usage across your application before proceeding. There’s a great article on how Airbnb did so in combination with Master-slave replication in order to stay on top of a period of high traffic increase.
Horizontal Partitioning (Sharding)
Although you’d hear about it often, it’s actually advised to delay this technique as much as you can as it tends to add much more overhead to the development environment.
Sharding is a divide-and-conquer technique. It’s about taking a table with *illions of records and breaking it down into smaller, segmented tables in a way that record X would only exist and be accessed directly at table segment Y, thus balancing the entire load across segments.
These techniques might have different implementations depending on the DBMS you’re using and even the pros and cons might vary from one another, yet they’re common practices across the industry.
If you’re experiencing scaling issues I encourage you to go deep on these topics an choose the one(s) more suitable for your needs as they tend to get along together fairly well.