MySQL 5.7 has sharding built-in through MySQL Fabric (Lab Release). And, MySQL has scaled the marketing team to handle 500.000 customer requests/s – or, was it MySQL 5.7 query throughput? Whatever. MySQL Fabric frees you from implementing sharding yourself. MySQL Fabric simplifies the management of MySQL Replication clusters of any size.
Is it any good? Yes, but MySQL must not stop here. This is not enough. MySQL is still a little too much behind trend setters.
Technically, MySQL Fabric is little more than a new management tool. At best, one could speak of a middleware based replication design to manage groups of primary copy clusters.
|Shard Group||Shard Group||…|
|Secondary (Slave, Copy)||…||Secondary (Slave, Copy)|
Catching up: simplified node and cluster deployment
Many NoSQL systems born after MySQL put much emphasis on distributed databases. Clustering and replicating databases became a must in 2010+. Setup and administration is easier than with MySQL Replication. MySQL Replication was introduced in 2000. Back then, replication was no mainstream. Very few deployed tens, hundrets or even thousands of MySQL servers for one web site. If setting up only two servers, how cares how comfortable it is as long as the database scales… In the years thereafter users created a rich set deployment tools.
Then came the new generation of databases: cluster setup made easy. This was a hard way of reminding MySQL of its own goals: it should take no longer than ten minutes from download to the running system. If your running MySQL system shall include a MySQL cluster, this is a hard to meet time limit. For a first-time user something either works, or it is out – the next minute. Agile, fast incremental updates, … MySQL Fabric has the potential to fix this. It is a safe bet to predict competition from within the MySQL ecosystem will strike back. Choice is always welcome!
|MongoDB||MySQL 5.7 with Fabric|
|1)||Install mongods||Install mysqlds|
|2)||On any node, create replica set||Using mysqlfabric, create master group|
|3)||On primary, add secondaries||Using mysqlfabric, add nodes, choose primary|
|4)||Built-in (proper) failure detector automatically activated||Choose failure detector, activate. E.g., use built-in (imperfect) failure detector.|
Partitioning (sharding) is the only way to scale reads and writes in a distributed database. Please see, the presentation “DIY – Distributed Database Cluster, or: MySQL Cluster” for an introduction into the theory behind distributed databases and why MySQL Cluster is an auto-partitioning/auto-sharding solution.
|Partitioning rule, for example: RANGE|
|MySQL server/shard 1||MySQL server/shard 2||MySQL server/shard 3|
shard_key = 1 .. 100
shard_key = 101 .. 200
shard_key = 201 .. 300
There are many approaches implement sharding on top of MySQL. Many big MySQL users have created their own application based sharding solutions years ago. Application based sharding was the use case for the International PHP Conference 2008 presentation on asynchronous queries supported by the PHP mysqlnd library (slides). The idea behind the approach is to have the application route certain queries to certain shards often based on some static rule.
MySQL Fabric follows the same basic pattern: clients must route requests to appropriate shards, application developers must provide a shard key. Fabric knows which shard must be used for any given shard key and applications or drivers talk to Fabric to learn about it.
This is not the most transparent solution (as opposed to MySQL Cluster) but a perfectly valid design decision. Application developers get maximum control over the data distribution. This is, for example, important when optimizing for latency and data locality. In general there are other ways to achieve the same but at the end of the day, as long as you can do what is needed…
Transactions and data model
Distributed databases have an enemy: transactions. As soon as lazy primary copy (classic MySQL Replication) is not sufficient to scale a given workload, and data is partitioned, distributed transactions may be required by queries spawning multiple partitions on multiple nodes. With MySQL Fabric, application developers might start using XA transactions. An XA transaction coordinates local transactions from multiple servers (shards). If, for example, you would want to fetch a consistent view of rows from multiple shards, you might start a distributed transaction covering those shards.
Distributed transactions are slow compared to local transactions. Some may argue it can be done (when implemented properly) sufficently fast (I’m among those), but the ultimate proof that local transactions are always faster is quite simple. Your network cable measures one meter, your HDD cable measures ten centimeters. Given constant speed of light any communication on the HDD cable will always be 10x faster. And, sooner or later we talk flash/RAM instead of spinning disks – even if we did not, a distributed transaction would have to persist results for D(urability) in ACID. On an aside: MySQL Cluster is "cheating" here, it will flush periodically to disk and avoid delaying transactions by waiting for the disk.
Hence, distributed databases must support a data model that avoids distributed transactions, if you want maximum performance. If anyhow possible, co-location should be used. This can be achieved, for example, using a tree schema (H-Store, ElasTraS), entity groups (Google Megastore) or table groups (Cloud SQL Server – Microsoft SQL Azure backend). Or, you denormalize (key/value respectively key/document).
How about Fabric? Well, there are global tables… I guess we have a lot to discover and explain to application developers . Guess what the topic of the workshop is that I am allowed to give at the PHP Summit (Berlin) in December: data models for distributed databases.