MySQL 5.7 has sharding built-in through MySQL Fabric (Lab Release). And, MySQL has scaled the marketing team to handle 500.000 customer requests/s – or, was it MySQL 5.7 query throughput? Whatever. MySQL Fabric frees you from implementing sharding yourself. MySQL Fabric simplifies the management of MySQL Replication clusters of any size.
Is it any good? Yes, but MySQL must not stop here. This is not enough. MySQL is still a little too much behind trend setters.
Technically, MySQL Fabric is little more than a new management tool. At best, one could speak of a middleware based replication design to manage groups of primary copy clusters.
MySQL Fabric | ||||
Shard Group | Shard Group | … | ||
Primary (Master) | … | … | ||
Secondary (Slave, Copy) | … | Secondary (Slave, Copy) |
Catching up: simplified node and cluster deployment
Many NoSQL systems born after MySQL put much emphasis on distributed databases. Clustering and replicating databases became a must in 2010+. Setup and administration is easier than with MySQL Replication. MySQL Replication was introduced in 2000. Back then, replication was no mainstream. Very few deployed tens, hundrets or even thousands of MySQL servers for one web site. If setting up only two servers, how cares how comfortable it is as long as the database scales… In the years thereafter users created a rich set deployment tools.
Then came the new generation of databases: cluster setup made easy. This was a hard way of reminding MySQL of its own goals: it should take no longer than ten minutes from download to the running system. If your running MySQL system shall include a MySQL cluster, this is a hard to meet time limit. For a first-time user something either works, or it is out – the next minute. Agile, fast incremental updates, … MySQL Fabric has the potential to fix this. It is a safe bet to predict competition from within the MySQL ecosystem will strike back. Choice is always welcome!
MongoDB | MySQL 5.7 with Fabric | |
---|---|---|
1) | Install mongods | Install mysqlds |
2) | On any node, create replica set | Using mysqlfabric, create master group |
3) | On primary, add secondaries | Using mysqlfabric, add nodes, choose primary |
4) | Built-in (proper) failure detector automatically activated | Choose failure detector, activate. E.g., use built-in (imperfect) failure detector. |
Sharding
Partitioning (sharding) is the only way to scale reads and writes in a distributed database. Please see, the presentation “DIY – Distributed Database Cluster, or: MySQL Cluster” for an introduction into the theory behind distributed databases and why MySQL Cluster is an auto-partitioning/auto-sharding solution.
table: users | ||||||||
---|---|---|---|---|---|---|---|---|
shard_key | first_name | other_columns[…] | ||||||
| | ||||||||
Partitioning rule, for example: RANGE | ||||||||
| | ||||||||
MySQL server/shard 1 | MySQL server/shard 2 | MySQL server/shard 3 | ||||||
table: users, shard_key = 1 .. 100 |
table: users, shard_key = 101 .. 200 |
table: users, shard_key = 201 .. 300 |
||||||
shard_key | first_name | other_columns[…] | … | … | ||||
1 | John | … | ||||||
… | Elena | … | ||||||
100 | Zacharias | … |
There are many approaches implement sharding on top of MySQL. Many big MySQL users have created their own application based sharding solutions years ago. Application based sharding was the use case for the International PHP Conference 2008 presentation on asynchronous queries supported by the PHP mysqlnd library (slides). The idea behind the approach is to have the application route certain queries to certain shards often based on some static rule.
MySQL Fabric follows the same basic pattern: clients must route requests to appropriate shards, application developers must provide a shard key. Fabric knows which shard must be used for any given shard key and applications or drivers talk to Fabric to learn about it.
This is not the most transparent solution (as opposed to MySQL Cluster) but a perfectly valid design decision. Application developers get maximum control over the data distribution. This is, for example, important when optimizing for latency and data locality. In general there are other ways to achieve the same but at the end of the day, as long as you can do what is needed…
Transactions and data model
Distributed databases have an enemy: transactions. As soon as lazy primary copy (classic MySQL Replication) is not sufficient to scale a given workload, and data is partitioned, distributed transactions may be required by queries spawning multiple partitions on multiple nodes. With MySQL Fabric, application developers might start using XA transactions. An XA transaction coordinates local transactions from multiple servers (shards). If, for example, you would want to fetch a consistent view of rows from multiple shards, you might start a distributed transaction covering those shards.
Distributed transactions are slow compared to local transactions. Some may argue it can be done (when implemented properly) sufficently fast (I’m among those), but the ultimate proof that local transactions are always faster is quite simple. Your network cable measures one meter, your HDD cable measures ten centimeters. Given constant speed of light any communication on the HDD cable will always be 10x faster. And, sooner or later we talk flash/RAM instead of spinning disks – even if we did not, a distributed transaction would have to persist results for D(urability) in ACID. On an aside: MySQL Cluster is "cheating" here, it will flush periodically to disk and avoid delaying transactions by waiting for the disk.
Hence, distributed databases must support a data model that avoids distributed transactions, if you want maximum performance. If anyhow possible, co-location should be used. This can be achieved, for example, using a tree schema (H-Store, ElasTraS), entity groups (Google Megastore) or table groups (Cloud SQL Server – Microsoft SQL Azure backend). Or, you denormalize (key/value respectively key/document).
How about Fabric? Well, there are global tables… I guess we have a lot to discover and explain to application developers :-). Guess what the topic of the workshop is that I am allowed to give at the PHP Summit (Berlin) in December: data models for distributed databases.
Happy hacking!
Pingback: PHP Unconference Hamburg 2013: hungry for more | Ulf WendelUlf Wendel
2013/09/23 at 19:09
I prefer the term “sharded tables” and “unsharded tables” . Unsharded tables are duplicated on all shards, while sharded tables are distributed between shards based on the shard key.
Also, how does fabric prevent the user from joining between tables that are sharded by different keys? Shard-Query prevents this by only allowing one shard key per virtual schema.
2013/09/23 at 19:25
It is a good idea to speak of sharded and unsharded tables. I should follow your wording.
Currently, Fabric does not prevent anything. Shards even accept writes for keys for which they are not responsible. This is something future versions should not allow, as far as I know. But, you are raising another issues that – to my knowledge – is not covered yet. Nobody thought about it.
The few existing fabric aware drivers don’t even cover all basics yet :-/. I guess, we can learn alot from http://code.google.com/p/shard-query/
Pingback: OL 4 MySQL: Extending my VM’s root f/s online | MySQL-Med
2013/09/27 at 07:17
Hi Justin,
Regarding the use of “sharded tables” and “unsharded tables” it seems to be more common to call the unsharded tables “global”. Personally, I have no strong opinion, so either one is fine and will probably vary depending on the context where one describes it.
In Fabric, is architected to support multiple mappings/indexes (I assume that is what you call a virtual schema) in the same manner: by restricting it to use a single key per mapping. It is possible to shard tables using multiple keys, but that requires executing a scatter-gather query whenever a shard is accessed, which affects performance and is also more complicated to implement. For the first version, we have just assumed there is a single mapping because the handling of global updates were not entirely sorted out.
MySQL Fabric uses a global server/group to execute global updates and then use replication to replicate the updates to the shards. If you note how the computation of the shard is done, MySQL Fabric accept a list of tables for the transaction and use that to automatically compute which shard mapping it belongs to and can thereafter compute the shard from the shard key. The case that were not fully sorted out is what to do when all tables where unsharded/global. Currently, a global server/group was assigned per shard mapping, but going forward we intend to have a global server/group for each Fabric instance instead.
Pingback: A new kid in the MySQL sharding world « Serge Frezefond 's blog
Pingback: MySQL Fabric with MariaDB Galera Cluster ? « Serge Frezefond 's blog
2015/10/01 at 19:48
Thanks Liran :)I believe I did touch the seucbjt of 2nd-gen sharding technologies when I referred to shard-able drivers and “shardware” (which stands for sharding middleware). My point, however, is that regardless the use of these technologies and no matter how transparent these are, at the end of the day there may be limitations (e.g., cross-shard TX or DB-embedded logic) that stem from the fact that sharding is external to the DBMS.As for MySQL becoming the bottleneck, this is certainly the case but ONLY when considering the standard, SMP-based setup. A distributed system design (such as ours) allows you not only to have the resources independently assigned to each storage partition but also to scale out and parallelize the database engine layer.