Ulf Wendel

PECL/mysqlnd_ms: Distributed Transaction/XA support coming, e.g. for MySQL Fabric?

2014/06/13 - by admin - 4 Comments

The development version of PECL/mysqlnd_ms now has an very early version of distributed (2PC/XA) transaction support tailored for use with clusters of MySQL servers. XA transactions can span multiple servers in a cluster to ensure transaction guarantees among them. In real life, there is always, at some point, a sequence of work on your data for which atomicity, consistency, isolation, and durability must be given. The early generation of NoSQL sharding solutions has a weak spot here. How do you, as a developer, ensure a logical operation affecting two shards is either applied to both or none? Either you don’t at all, or you hack and pray, or you start reading lecture books on transactions, or you grow your documents to avoid cross-shard operations. MySQL Fabric, our own sharding solution has a gap here too. However, let three new API calls do the troublesome part for you: mysqlnd_ms_xa_begin(), mysqlnd_ms_xa_commit() and mysqlnd_ms_xa_rollback().

Any sharding cluster: consistent work across shards?
Any task that involving more than one shard, for example: copy customer information into order and update order counter stored with customer in one step. Or: update the email of a user in all places – forum users, orders, customers.
Shard 1 – Primary   Shard 2 – Primary
Customers: US residents only Customers: rest of the world
Forum users: all Orders: all
| | | |
Copy Copy Copy Copy

Massive news but where are we on distributed transactions?

The current XA support of the PHP Replication and Load Balacing plugin for mysqlnd is hardly any good – yet! Still, we sort of have a tradition of explaining pain point and feature gaps early. MySQL 5.6/5.7 is on the track learning NoSQL lessons (protocol and query language flexibility, nested data, replication experience), MySQL Fabric is now GA and brings us Sharding and automated High Availability outside MySQL Cluster. Fabric abstracts managing shards, shuffling data around during resharding and telling a client where to find data. Also, the core server improves and some massive pain points from when we started (properly detecting transaction boundaries, reliably finding up-to date replicas) are being addressed.

But: distributed transactions… Lets keep the focus narrow and talk XA only, let’s forget about cross shard joins here. Even if the PHP MySQL APIs had proper XA support (and, of course, it will be so in no more than two hours ;-)), we would hit MySQL limitations. Hence, time to stark the barking.

How it shall look in code… one day

That warning given, let’s talk code. For the “copy customer information into order and update order counter stored with customer in one step” task from the introduction, your code looks like this, if you want transaction guarantees:

/* Connect to cluster: mysqli, PDO_MySQL, ... whatever */
$link = mysqli_connect('my_sharding_cluster', ...);

/* Begin distributed transaction */
mysqlnd_ms_xa_begin($link, $xa_id);

  /* Do something on your shards */
  mysqlnd_ms_fabric_select_shard($link, "shop.customers", "us_123");
  $link->query("SELECT * FROM customer 
                        WHERE key = 'us_123'");
  ...
  $link->query("UPDATE customer SET 
                  order_counter = order_counter + 1 
                   WHERE key = 'us_123'");
  mysqlnd_ms_fabric_select_global($link, "shop.orders");
  $link->query("INSERT INTO order(...) VALUES (...)");

/* End distributed transaction */
mysqlnd_ms_xa_commit($link, $xa_id);

Remove mysqlnd_ms_xa_begin(), mysqlnd_ms_xa_commit() and you go without distributed transactions. Your choice! Should there be a series of failures at the right time, you get logically inconsistent data without the distributed transaction. With MySQL, invidiual actions will still be consistent as we default to InnoDB these days. InnoDB is nowadays faster than the non-transactional MyISAM from the anchient days of MySQL, and very much so under load.

A word on XA/2PC to illustrate the convenience PECL/mysqlnd_ms brings

The XA specification has 94 pages. The (My)SQL world folds this into six XA related SQL commands. XA follows the two-phase commit protocol. The protocol operates in two rounds and has two distinct players. The players are a coordinator and the participants. In the first phase the coordinator asks all participants whether they are ready to commit some previous work carried out on them.

Two-phase commit (2PC) in XA: first phase
Coordinator Participant Participant Participant
–> Vote request  
–> Vote request  
–> Vote request
<– Pre commit  
<– Pre commit
<– Pre commit

A participant that replies to the coordinators vote request with a pre commit makes the firm promise that the work done will not be lost even the case of a temporary failure of itself, e.g. when it crashes. But, the pre committed work does not yet become visible to others. Should all participants reply positively to a vote request, the coordinator sends a global commit, which makes the work visible to everybody.

Two-phase commit (2PC) in XA: second phase
Coordinator Participant Participant Participant
–> Global commit  
–> Global commit  
–> Global commit

Failures of participants are no problem for the protocol. The coordinator can use timeouts to detect unresponsive participants and either retry after the participant is recovered or send a global rollback to the rest. But should the coordinator crash in the course of informing participants of a global rollback or global commit decision, the protocol becomes a blocking one. The participants wait for instructions.

Two-phase commit (2PC) in XA: blocking protocol
Coordinator Participant Participant Participant
–> Global commit  
Crash  
  Comitted Uncomitted: waiting for global commit or rollback

If you call mysqlnd_ms_xa_begin(), PECL/mysqlnd_ms acts as a coordinator. It must handle all the state transitions of the participants, the possible failures of participants and its own failure as a coordinator.

Should you have become curious about 2PC or the challenges of distributed transactions… MySQL Cluster is using two-phase commit internally (2PC), see also the presentation DIY: A distributed database cluster, or: MySQL Cluster or the really in-depth presentation Data massage: How databases have been scaled from one to one million nodes, which takes you from early distributed NoSQL to the very latest developments.

What PECL/mysqlnd_ms does for you, what it hide, where its limited

As said, there are six SQL commands related to XA/2PC:

  • Phase 1 related
    • XA BEGIN – mark the beginning of a unit of work on a participant, telling the participant what to vote about
    • XA END – mark the end of the unit of work on a participant
    • XA PREPARE – the pre commit
  • Phase 2 related
    • XA COMMIT – global commit
    • XA ROLLBACK – global rollback
    • XA RECOVER – error handling: list pre commits

PECL/mysqlnd_ms further compresses this to three API calls. The calls hide issuing the SQL commands on the participants and all the steps a coordinator has to perform. Some features are lost on the way, but convenience is won.

The loss of features is not critical: as a general rule we strive to provide you with a way to overrule any automatic actions, go down to the lower layers and handle anything on your own. The features you loose are choice when participant failures happen and support for other XA participants but MySQL servers. Upon any participant error the automatic action is rollback. Future versions may lift this limitation, I am blogging about pre-alpha development version. At this point, I also have not bothered about any other XA participants but MySQL servers. The XA specification describes distributed transactions for any system implementing a certain API. That could be a database server, a web service or a moon rocket. Other RDBMS also feature SQL commands for XA so that you could have Microsoft SQL Server, IBM’s DB2, Oracle, Postgres and MySQL jointly working on a distributed transactions. If you need that, use the SQL commands directly.

I have had MySQL Fabric in mind, the sharding scenario. Fabric manages farms or clusters of MySQL servers, nothing else. And, MySQL should not loose one of its strength when combined with Fabric: transactions, should you need them.

PECL/mysqlnd_ms as a transaction coordinator

Should you not fear the risk of blocked servers when the coordinator crashes, you can use the new functions straight away.

On mysqlnd_ms_xa_begin(), PECL/mysqlnd_ms will first try to find out whether you are in the middle of a local transaction. Local transactions and global transactions (XA) are mutually exclusive. To detect local transaction boundaries, PECL/mysqlnd_ms monitors all API calls related to them, e.g. mysqli_begin_transaction(), mysqli_autocommit(). It does not monitor SQL commands, such BEGIN TRANSACTION, though. Details are described in the manual. Good news is: mid term the MySQL server will announce transaction boundaries and we can make things bullet proof. Should you not be in the middle of a transaction, it remembers the so-called gtrid (global transaction identified) given in the second parameter: bool mysqlnd_ms_xa_begin(mixed connection, int gtrid). At the time of writing, gtrid is massively limited – I haven’t finished my SQL related C code. That’s for sure a temporary limitation.

Then, as you continue PECL/mysqlnd_ms transparently injects XA BEGIN gtrid on every shard/node that you run a query on. When doing so it also calculates a bqual (branch qualifier) but does not use it yet – for the same reason as above: lazy me, early blogging…

Built-in garbage collection to cover coordinator crashes

I assume most users talking transactions want to see a coordinator crash covered too. This is done by optionally recording all state changing actions in a persistent store that survives a crash. Should the coordinator – your PHP process/script – crash, another process/script can do a garbage collection run to get the participants into a defined state: rollback, clean up.

Garbage collection is built-in. It is done the traditional way: wait for the next PHP process/script to begin its work and run the garbage collection based on a probability value. Very much as PHP sessions or PECL/mysqlnd_qc (query cache plugin) does it. An additional timeout can be configured to decide whether a recorded global transaction is and its coordinator are still active or they should be considered crashed. Had I invested more brain-power already, maybe, the timeout would not be need. Key is: its automatic.

Note that for the garbage collection to work, one needs to record information how to connect to the participants. Recording of user name and password has been made optional and the special meaning of localhost has been thought of. Should the GC be run on a different host but the one who wrote a record with host=localhost, we need to know to which IP localhost refers. Some things are covered and tested – but, this is pre-alpha…

The plugin supports using any data store as a backing store, but only one storage backend is implemented: MySQL. MySQL is transactional, survives crashes and after all we are talking about managing MySQL farms here.

mysqlnd_ms_xa_commit(), mysqlnd_ms_xa_rollback() – note the limitation

Upon mysqlnd_ms_xa_commit(), the plugin becomes chatty. It tells all participants that the work is done (XA END) and gathers pre commit replies (XA PREPARE) to decide on global commit or global rollback. Should all participants give a positive reply, it does send a global commit (XA COMMIT) as requested and mysqlnd_ms_xa_commit() indicates success.

The emergency break is pulled in case of any error with the participants: rollback as much as we can, leave any possible rest to the garbage collection. There is currently no way for the user to react to low-level errors and, for example, do a reconnect to a participant. To handle such cases, the user would need to start handling the connections to participants itself. Any API for doing this would be more complex than just doing everything yourself using SQL and plain vanilla connect()/query() calls! (These lines are being written in early June 2014 – check the manual for updates should you read this later. Some tweaks are likely to happen.)

The function mysqlnd_ms_rollback() also ends the XA transaction on the participants but always sends out a global rollback.

The sad truth: not yet with Fabric, plugin refactoring needed…

Before I can close and move a black jack to the overloaded server folks, well, yeah: it doesn’t work together with Fabric yet. Our fault.

The reason is an implementation detail of PECL/mysqlnd_ms, which is not visible to you. The plugin does not have any dedicated connection management module. We keep a list of all servers of a cluster. When we choose a server for you, say the read-write splitter tells us to use a master, we open a connection and associate it with that server list. When you use Fabric and call mysqlnd_ms_fabric_select_shard()/mysqlnd_ms_fabric_select_global(), as in the example from the beginning, the whole list is replaced. With it, the connections are gone. If there was a participant connection in the list… *ouch*. This is also the reason why I think the state alignment feature is broken when using Fabric.

Fixing this, using a simple option, takes days. However, Andrey, Johannes and I need to find a time to discuss our options. Documenting XA, which I have not done yet, also takes days.

The sad truth: our server has limitations too…

Any server change take way more time for various reasons. The server is more complex, server code plays in a different league, there is little chance of saying we just did it that way, server developers are a rare species and the server has a different release cycle. The server manual reads:

If an XA transaction has reached the PREPARED state and the MySQL server is killed (for example, with kill -9 on Unix) or shuts down abnormally, the transaction can be continued after the server restarts. However, if the client reconnects and commits the transaction, the transaction will be absent from the binary log even though it has been committed. This means the data and the binary log have gone out of synchrony. An implication is that XA cannot be used safely together with replication.

In other words: if you don’t mess around, if you make the emergency break – rollback – approach, as PECL/mysqlnd_ms does, over any attempt to recover, there is a chance it works depite not being bullet proof. I can only hope any server plugin toying with XA keeps an eye on this. The manual then goes on:

It is possible that the server will roll back a pending XA transaction, even one that has reached the PREPARED state. This happens if a client connection terminates and the server continues to run, or if clients are connected and the server shuts down gracefully. (In the latter case, the server marks each connection to be terminated, and then rolls back the PREPARED XA transaction associated with it.) It should be possible to commit or roll back a PREPARED XA transaction, but this cannot be done without changes to the binary logging mechanism.

Again, make the emergency break policy your default, and it may become acceptable. At least for coordinators and implementations tailored for MySQL Fabric. If that’s all the restrictions, and no major bug is lurking around then its time for: how to use distributed transactions with MySQL Fabric sharding ;-).

For proper XA support, as it may be needed by JDBC, these restrictions could be a deal breaker.

Happy PHP hacking!

@Ulf_Wendel Follow me on Twitter

Background: The PHP replication and load balancing plugin

Happy birthday! Three years ago we pimped all the PHP MySQL APIs to make using any sort of MySQL cluster easier. We began developing a plugin which does all the annoying jobs that clusters may burden developers with: transaction aware read/write splitting if needed, load balancing with the option to assign weight to a replica, connection state alignment when switching servers, automatic client failover, user defined filters/pipes for replica selection, an abstraction for requesting eventual consistency, session consistency (defined as read-your-writes) and strong consistency with a single API call. That is 90% of the infrastructure one needs to get all out of MySQL Fabric. But, MySQL Fabric support is still in the works.

All these things are done on the client side to scale by client. Unlike with basic proxy solutions there is no single point of failure as a single clients failure does not affect the rest.

Yet, if used for nothing but read-write splitting, the ease of use is not compromised. It boils down to installing the plugin and replacing the host name in your APIs connect call mysqli_connect("my_mysql_server", ...) with the name of an entry from the plugins configuration file: new PDO("mysql:section_from_ms_config", ...). And, as you dig deeper into massive database clustering, you will soon realize how powerful it is to have clients supporting a specific cluster design, for which you just don’t want an all-automatic solution.

PHP Unconference Europe, a legend. But lacking PHP…

2014/05/21 - by admin - 0 comments

A legend is born: PHP Unconference Europe, Majorca. Nearly one hundred witnesses exist. The #phpuceu combines the proven concept of an unconference with the beauty of the island of majorca. Within a day after the end of the event, one third had bought a ticket for next years issue (May 9-10, 2015)! We all knew for years, unconferences work. We finally have one at a place that is easy to get and is just about perfect for any kind of visitor. Goal accomplished. Just a detail, PHP (partner happieness program) was missing…

The power of the people

Unconferences give power to the people. Participants gather in the morning to propose sessions and vote on an agenda for the day. Sponsors and participants never know what road will be taken. However, with a solid number of PHP experts and frequent speakers traditionally being around, the worst that can happen is that they are asked to repeat one of their usual talks… I have visited some ten PHP Unconferences in the past years. Usually, heavy-weights sometimes have issues to get enough votes for a slot because the other propsals are so hot! Also, Some of this years sessions

Sessions

What kind of session is again up to the audience! Johannes Schlueters got it right and gathered with followers in the lobby area when no room was available. Kristian Koehntopp tortuered listeners with a spontaneous drawing on a tissue how to get benchmarking right… Long, long breaks for discussions and a need to become active are part of the game.

The soft factor: majorca – it’s not expensive

I must say, I am surprised how well the visitors behaved. The no-show rate on the first day was as little as three percent. This is absolutely amazing. Recall that a flight to Palma can cost you no more than a train trip through your country. Round trip flights to many destinations are around 150-200 Euros. A train ticket from Kiel in the north of Germany, where I live, to Munich in the very south and back is 210-280 Euros. And, #phpuceu sponsor trivago reports 210 euros for a single room over the weekend in Munich, which is an exact match for Palma.

At the same costs of the national PHP Unconference Hamburg, I get a weekend vacation in the sun! Still, people showed up on early Sunday morning with a smile in their face. They knew yummy catering (tapas) was on the agenda for lunch.

If everbody is relaxed, discussion go smoother. And, exchanging ideas, getting connected is all what it is about: whether you wanted to bother us MySQL folks, ask Zend about the green jumbo elePHPant, hack PHP or squeeze out long term project management consultans… all up to you. Or, how about table soccer to break the ice.

The smile factor: food, sun and partying

Compared with a national unconference, say one in Munich, Palma has more party locations and better drinks to offer. The 1 litre beer monster, which goes stale before it reaches you in an huge umpfda-umpfada hall is available. If you really want that… see the photo.

But then, there is also the other world of smaller restaurants too with the usual Twitter or Facebook comment going like: “Relaxing after the #PHPucEU” or “Nachklang zur ‪#‎phpuceu‬ an klasse Essen und Wein”, or this one http://click-to-read-mo.re/p/7dr8/5321aee8… The fine dining or the simple yet tasty local food, the vacation feeling, the beaches, the old town…

… hiking in the hills, cycling or riding a scooter. Quite some decided it would be a good idea to stay a day longer for some nice time on the island in late spring. This has a very positive effect on the unconference itself. There is less of a rush on sunday. People stay for an additional night and the last session of the day is almost as popular as the first one was.

The trained elePHPant

On the first day the organizers (Judith Andresen, Fabian Blechschmidt, Jonathan Maron, Karl Spies) announced a photo competition. Zend had given them one of the big green elePHPants for the winner. The task was to take a photo of the elePHPant going on safari during the event. The winning photo is…

elePHPant

… and the winner got a hug from one of the organizers, after the latter had stopped biting the elePHPant! Something that obviously has irritated the new shepard of the elePHPant.

My own elePHPants, well prepared and trained for the trip, became friends with him. However, I think, the big one is not quite such a sports gun as the smaller ones are.

PHP – Partner happieness program

This was a wonderful event: work and fun combined! Despite the fun, everybody needs measurable results to justify their investments. Search Twitter, Facebook, Google+ or blog posts for opinions, if you have not participated and want to know if all was green – not just the elePHPant.

Of course, sponsors do the same. For sponsors it is not all about getting a new hire or immediately winning a big customer. These events have a lot to do with learning what really matters to the crowd. It is about reaching out to key people that can act as multiplicators. And, it is about fostering the community. Obviously, many sponsors have been very pleased with this years results. AFAIK, some announced their strong interest to come back in 2015, or even gave firm promises. Let’s shout THANK YOU at:

PHP is missing! Spouses and family welcome !?

For next year, it would be perfect if there was a partner happieness program – short PHP. Ideally, it would be run as an unconference. All spouses gather in the morning to propose something and vote about their plans for the day. I would be fine with that as long as they don’t start stealing sponsors, or do photo contests about the best looking new potential partner or the like….

Again, 200 reasons to go there…

Happy hacking!

@Ulf_Wendel Follow me on Twitter

Train your elePHPants for the PHP Unconf EU – Palma de Majorca!

2014/05/01 - by admin - 1 Comment

Majorca! May 2014 – 17./18! Could there be a better place and time on earth for the PHP EU Unconference: sun, fun, plenty of cheap flights, affordable accommodation? The PHP EU Unconference (tickets) is the international version of the national unconference. The national one frequently sells out quickly (>350 tickets) not only because it matches other conferences despite the ticket price of ~40 Euro only. Traditionally the international one, is more cosy with about 100 elePHPant lovers coming in the years before.

Palma de Majorca

A schedule that really matters – with a solid fallback…

The key to success is the very nature of an unconference. There is no call for papers and no program comittee that makes more or less educated guesses about what the audience might want to see. There is no risk of a schedule dominated by sponsor talks. It is the very audience that gathers every morning to decide about the schedule for the rest of the day. This works amazingly well: only topics that really matter to the audience have a chance to pass the voting. And, with many heavy-weights and frequent speakers being around, the worst that can happen is that they give a talk. The "2716th What’s new in PHP", the 1872th "MySQL – my love" and the "999th PHPUnit talk" are available at your convenience, but only if there is nothing hotter than that ;-). That said, here’s a random list of well known nerds that are said to be coming in two weeks: Sebastian Bergmann, Johannes Schlueter, Stefan Priebsch, Kore Nordmann, Lars Jankowsky, Arne Blankerts, Boris Erdmann, …



Train your #elePHPant for Majorca! #phpuceu Take fotos of your #elePhant and tweet them.
(@phpuceu)

Sun and fun

A day at either the national or the PHP EU Unconference follows the same pattern:

  1. eat and drink: you have breakfast together – your ticket includes catering during the day
  2. contribute: you propose a talk and hope for votes or just vote
  3. contribute: you start talking either in front of others or with others
  4. eat and drink: you have lunch
  5. contribute: continue talking
  6. eat, drink and contribute: party – sleep, goto 1)

All this hard work finally takes place where it should to ensure a relaxed atmosphere. If you plan an international meeting without knowing from where people come, the travel costs are a constant. Pick any place having a major airport, pick one that is beautiful in May! Majorca is a perfect choice.

Flights are available, still

Being a popular party location, there are many low-cost flights are available. Round-trip flights from Paris/London/Berlin to PMI are still available starting at ~120 Euros. During JS Unconf, I’ve heard of people sharing holiday homes (including Wifi :-)) for very low rates. If you have a deeper pocket you may prefer the unconference hotel. Check for a good offer. My wife and I booked two days ago at a very reasonable rate.

If you consider staying in Palma for some extra days, it may happen that you run into some other Unconf visitors. Those I spoke to and said they consider to spent some extra days in Majorca tend not to arrive before May 16 but stay longer than May 19.

Train your elePHPants!

When I told my elePHPant that we ultimately booked the trip to Majorca, he got crazy. He insisted of being trained for life on sunny beaches. Hope he will have the opportunity to get to know many new friends. Go share photos of how you prepare your elePHPant for sun, fun and PHP EU Unconf with @phpuceu.

And, as you are at it, think about session proposals. At an unconf you are not limited to giving a talk. There is a time slot waiting to be filled with whatever gets enough votes. You may present, discuss, entertain, train elePHPants, run an elePHPant meetup…

Happy hacking!

@Ulf_Wendel Follow me on Twitter

JS Unconf Hamburg 2014

2014/04/28 - by admin - 2 Comments

What happens when you take a well known receipt for unconferences, a group of young motivated and talented chefs, some 300 attendees is in the photos below. Some call it, the JavaScript Unconference Hamburg (JS Unconf). Some say, they had seasons in the sun. All I know is, I’ve been there and will come again!

The menu delivered by the chefs is as tasty and sophisticated as that of a commercial conference but it does not cost you a fortune. For some 40,– Euro you get two days packed with talks, intensive discussions and nice fellows. It is the people that drive an unconference and that becomes obvious the very moment you get involved. People know their jobs, they know your problems as they experienced the same. People have something to share and they want to share to with others!

How it works

Like the PHP Unconf Hamburg, the JS Unconf takes place at a building of Hamburg university in the hearth of Hamburg. On Saturday morning everybody gathers in or around the “bath tube” in the lobby area to propose sessions. Then, voting starts to find the lucky ones that run their discussions, talks or whatever has been proposed. Six lecture rooms that can fit 50 to 250 people each are available and there are up to four rounds of sessions per day. Despite the large number of slots it is not easy to get elected – the level is high. And, there is no program comittee that could prefer a big name or a sponsor over a just brilliant session. Elected gets only what’s hot to the very audience at the very day: AngularJS, Hoodie, React, …

Being unfamiliar with the JavaScript celebreties, I cannot give a list of big names spotted at the event. I managed only to identify one of the uber-names with Jan Lehnardt (CouchDB, Hoodie), a solid group of well known PHP experts that also do JavaScript work, or the all-mighty Kristian Koehntopp (working with a group of ’40 sufficiently crazy nerds’). However, my first, random click on the program and slides shows that the speaker has once written a book. Need I say more about how these ‘small’ events compare to the ‘big’ ones? Small means some 300 tickets sold, 300 arguments to join next year… get your ticket in time, it was booked out.

New: lightning talks

The fresh group of JS Unconf chefs has spiced the event with a change: lightning talks. It is a clever change inspired by two observations. First, some topic and thoughts worth sharing and presenting do not fill 45 minute slots. Also, giving a 45 minutes show takes some serious time to prepare depending on the topic. Second, many people have to travel back early on Sunday afternoon, which means they are forced to leave in the middle of a talk.

The talks took place in the “bath tube” with everbody enjoying the show and leaving whenever need be. They absolutely delivered what to expect at an Unconf: ranging from request for help, to presenting a thought or showing how to monitor your power consumption using RasperryPI, toys and stuff! Just watching how people use a 12” notebook to share slides when no beamer is available was worth it.

It just works…

I have a hard time saying much about the event as it just worked! The organizers Robert Kowalski, Robert Katzki and Robin Drexler have had help from Hinrich Sager and Ekkehard Dörre. The latter two have helped to run numerous PHP Unconferences at the same place before. Consequently, everything was running smooth as usual yet with nice tweaks from the new chefs. They just do it right. They just don’t run out of beer, their party is where all the parties take place, it is easy to get to the event, beamers and WIFI works, the food is tasty and healthy, they have weird soft drinks such as rhubarb, they serve first class fair trade coffee, people are relaxed and friendly – they learned their lessons in the years before…. More of this, please!

Credits to sponsors!

No such event, no such low ticket price without the sponsors. Thank you, …

Happy hacking!

@Ulf_Wendel Follow me on Twitter

29/07/2014 – Added official figures: 342 tickets sold, 294 appeared on Saturday.

PHP mysqlnd memory optimizations: from 49MB to 2MB

2014/04/10 - by admin - 3 Comments

Inspired by Antony, Andrey has implemented a memory optimization for the PHP mysqlnd library. Depending on your usage pattern and the actual query, memory used for result sets is less and free’d earlier to be reused by the PHP engine. In other cases, the optimization will consume about the same or even more memory. The additional choice is currently available with mysqli only.

From the network line into your script

Many wheels start spinning when mysqli_query() is called. All the PHP MySQL APIs/extensions (mysqli, PDO_MySQL, mysql) use a client library that handles the networking details and provides a C API to the C extensions. Any recent PHP will default to use the mysqlnd library. The library speaks the MySQL Client Server protocol and handles the communication with the MySQL server. The actions behind a users mysqli_query() are sketched below.

The memory story begins in the C world when mysqlnd fetches query results from MySQL. It ends with passing those results to the PHP.

PHP script mysqli extension/API mysqlnd library MySQL
*.php *.c
mysqli_query()  
  PHP_FUNCTION(mysqli_query)  
  MYSQLND_METHOD(query)  
  simple_command(COM_QUERY)  
  COM_QUERY
  store_result()  
  return result set  

The new memory optimization is for buffered result sets as you get them from mysqli_query() or a sequence of mysqli_real_query(), mysqli_store_result(). With a buffered result set, a client fetches all query results into a local buffer as soon as they become available from MySQL. In most cases, this is the desired behaviour. The network line with MySQL becomes ready for a new command quickly. And, the hard to scale servers is offloaded from the duty to keep all results in memory until a potentially slow client has fetched and released them.

The result buffering happens first at the C level inside the mysqlnd library. The buffer holds zvals. A zval is internal presentation structure for a plain PHP variable. Hence, think of the mysqlnd result buffer as a list of anonymous PHP variables.

PHP script mysqli extension/API mysqlnd library MySQL
*.php *.c
  store_result()  
  Buffer with zvals (MYSQLND_PACKET_ROW),
think: PHP variables
 

The default: reference and copy-on-write

When results are to be fetched from the mysqlnd internal buffers to a PHP script, the default behaviour of mysqlnd is to reference the internal buffer from the PHP script. When code like $rows = mysqli_fetch_all($res) is executed, first $rows gets created. Then, mysqlnd makes $rows reference the mysqlnd internal result buffers. MySQL results are not copied initially. Result set data is kept only once in memory.

PHP script mysqli extension/API mysqlnd library MySQL
*.php *.c
  store_result()  
  Buffer with zvals (MYSQLND_PACKET_ROW),
think: PHP variables
 
$rows = mysqli_fetch_all($res)  
  rows[n] &= result_buffer_zval[n]  

The reference approach works fine if you have code that follows the general pattern: query(), fetch() followed by implicit or explicit unset(), free().

A call to query() fills the internal buffer. Then, all rows are fetched into PHP variables. You may fetch all of them at once or read them row by row into the same array using a pattern such as while ($row = $res->fetch_assoc()) { ... }. As long as you do not modify $row and free $row explicitly (may happen implicitly by overwriting them in a loop), no data will be copied. Then, as the last step you call free() to dispose the buffered result set.


$res = mysqli_query(...);               /* create internal buffer */
$rows = $res->fetch_all(MYSQLI_ASSOC);  /* rows &= internal buffer */
unset($rows);                           /* remove references */
$res->free();                          /* free internal buffer */   

The memory saving by using references is gone, if mysqlnd is forced to perform a copy-on-write. If you free the result set prior to freeing $rows, then mysqlnd is forced to copy the data into $rows before it can free the internal buffer. Otherwise, $rows points to nowhere.

$res = mysqli_query(...)     /* buffered results */
$rows = $res->fetch_all(MYSQLI_ASSOC)  /* rows &= internal buffer */
$res->free();                /* free internal buffer: copy to rows */
unset($rows)                 /* free copied results */


The copy_on_write_saved and copy_on_write_performed statistics tell you what your code does.

When I say that memory is free’d I mean that it is given back to the PHP engine. The PHP engine may or may not release it to the system immediately. If you are just after monitoring how mysqlnd behaves, check the statistics, or go for huge result sets to ensure the garbage collection kicks in to make effects visible.

The price of copy-on-write: management overhead

Copy-on-write does not come for free: mysqlnd must track variables that reference the internal buffer. This adds some memory overhead (zval**). And, the internal tracking list must not point to nowhere . As long as the tracking list exists, the referenced user variable – here: the elements of $rows – must not be released. There is kind of a circular reference. The resulting overhead is two-folded: there is an additional tracking list and the references from elements of $rows might occupy memory until free() is called. But, you save holding query results twice in memory. So the question is what takes more memory, and when?

PHP script mysqli extension/API mysqlnd library MySQL
*.php *.c
  store_result()  
  Buffer with zvals (MYSQLND_PACKET_ROW)  
$rows = mysqli_fetch_all($res)  
  assign rows[n] &= result_buffer_zval[n]  
  remember rows[n] belongs to result_buffer_zval[n]  
  increase rows[n] reference counter  

The answer depends on your code and the size of the result sets. If you fetch some few thousand rows at most on your one PHP server, it does not matter much how mysqlnd manages the result sets. The implementation detail impact is not worth optimizing: the impact is small and it takes time consuming fine tuning to find the optimum. But for Antony it mattered: likely, he handles PHP at scale…

New: classical copy as an alternative

At times it is more efficient to bite the bullet and to copy data from the internal buffer into user variables. It may sound counter intuitive but this may help to ‘save’ memory. ‘Save’ must not only be measured in absolute terms at a given point in time. This is what the figures below will show as it is easy to grasp. When done with reading ask yourself what it means to release memory early during the execution of one or many scripts, think of memory usage over time.

PHP script mysqli extension/API mysqlnd library MySQL
*.php *.c
  store_result()  
  Buffer with zvals (MYSQLND_PACKET_ROW)  
$rows = mysqli_fetch_all($res)  
  assign/copy rows[n] = result_buffer_zval[n]  

Every optimization is a trade: the COPY don’t do

The new always copy policy can be enabled using mysqli_store_result(MYSQLI_STORE_RESULT_COPY_DATA). Let’s consider a case that seems to prove one cannot ‘save’ memory by duplicating data. The code runs a single SELECT that returns a result set of 200.000 rows. Anything big enough to make the basic principles discussed visible through memory_get_usage(true) works. Then, all rows are fetched into using fetch_all() and the result set is released.

$link = new mysqli("127.0.0.1", "root", "", "test", 3307);
$link->real_query("SELECT * FROM test LIMIT 200000");
$res = $link->store_result(MYSQLI_STORE_RESULT_COPY_DATA);
$rows = $res->fetch_all(MYSQLI_ASSOC);
$res->free();

The diagram shows the memory usage reported by memory_get_usage(true) immediately after the function calls. The default behaviour is to make $rows reference (red, NO COPY) the internal buffer. The blue line shows the effect when $rows gets populated with copies. The initial memory usage of the reference approach is lower until free() is called. Now copy-on-write must be done and either approach uses the same amount of memory.

BTW, don’t get nervous about the huge figures. This is what happens if you create large PHP arrays…

fetch_all() + STORE_COPY: memory

Function runtimes measured with microtime(true) are as expected. The reference approach is a tad faster when $rows is to be populated but has some hidden costs at free() when the internal buffer reference lists are to be checked.

For this very script, with this very result set, version of MySQL and PHP and notebook there was no significant difference in total runtime. Only a minor tendency of MYSQLI_STORE_RESULT_COPY_DATA being ever so slighty slower became visible. This meets expectations due to the additional memory copies and allocations.

fetch_all() + STORE_COPY: runtime

Counter example done, time to show the potential win

Copying result sets is no significant win when you have many rows and use fetch_all(), when you have few rows at all (point SELECT, paging with LIMIT, few hundret rows overall). Or, if you have few large rows with, for example, BLOB columns. In those cases, there is no need to bother about the flavour of the mysqlnd result set handling.

The code to try the copy optimization with should:

  • fetch many, rather small rows
  • and, should not use fetch_all()
  • or… is sloppy with free() (bad style, use mysqlnd statistics to identify such scripts)

Replace the fetch_all() with while ($row = $res->fetch_assoc()) in the 200k rows result set example and you immediately see the potential of Antony’s proposal:

$link->real_query("SELECT * FROM test LIMIT 200000");
$res = $link->store_result(MYSQLI_STORE_RESULT_COPY_DATA);
while ($row = $res->fetch_all(MYSQLI_ASSOC)) {
  ;
}
$res->free();

Peak memory reported for this code is 1808 KB (~2MB) versus 50368 KB (~49 MB) for the referencing logic! But the runtime is about 10% slower. As with all those benchmarks: run your own! I am showing trivial microbenchmark observations only to highlight the principles behind. I did not even bother to align the negative values reported by memory_get_usage()

fetch_all() + STORE_COPY: proper usage, memory

The lower peak memory usage is because the memory used for $row can now be released as the result set is being iterated.

Pseudo code to illustrate copy logic during fetch loop
PHP call PHP internals memory objects
store_result() Create internal_row_buffer[]
  • internal_row_buffer: n rows
$row = fetch_assoc() create $row, copy internal_row_buffer[0] into $row
  • internal_row_buffer: n rows
  • + copy of internal_row_buffer[0]
  • + zval $row
$row = fetch_assoc() free $row, copy internal_row_buffer[1] into $row
  • internal_row_buffer: n rows
  • – copy of internal_row_buffer[0] (previous contents of $row)
  • + copy of internal_row_buffer[1]

Obviously, with the copy technique, there is no reference from the mysqlnd library to $row because mysqlnd does not have to bother about copy-on-write and the like. $row contains a copy of the actual data from MySQL and no longer points to the mysqlnd internal result buffer. The table below tries to illustrate the situation when data copies are avoided and references are used:

Pseudo code to illustrate reference logic during fetch loop
PHP call PHP internals memory objects
store_result() Create internal_row_buffer[], internal_reference_list[]
  • internal_row_buffer: n rows
  • internal_reference_list: 0 entries
$row = fetch_assoc() create $row, $row = &internal_row_buffer[0]; internal_reference_list[0] = $row;
  • internal_row_buffer: n rows
  • internal_reference_list: + 1 entry (first $row)
  • + zval $row
$row = fetch_assoc() $row = &internal_row_buffer[1]; internal_reference_list[1] = $row;
  • internal_row_buffer: n rows
  • internal_reference_list: 1 + 1 entries (first, second $row)
  • zval $row (first) + zval $row (second)
n-th $row = fetch_assoc()
  • internal_row_buffer: n rows
  • internal_reference_list: n entries (all $row)
  • n * zval $row (all)
unset($row); free_result($res) free(internal_row_buffer); free(internal_reference_list) empty

Don’t get too excited. The copy approach is not an ultimate solution. There are many factors to consider: actual code, actual statements and the size of their result sets, your servers demands. And, don’t even try to count bytes: some hundret or even thousand bytes here and there may not matter much. Write proper code (explicitly calling free()) and check your servers reaction.

The key takeaway: try MYSQLI_STORE_RESULT_COPY_DATA

The key takeaway is: try MYSQLI_STORE_RESULT_COPY_DATA if you have result sets with many rows. The copy approach will release memory faster to PHP for reuse and thus peak memory usage may be significantly lower. Again, be warned that I have choosen my examples carefully to show a hefty difference when doing point in time memory measurements. There may be a broad grey area for which you need to consider memory usage over time and possibly over many scripts to decide whether copy or reference fits your bill.

Ready for use?

Yes, the code is ready for use! The patch was running in production for a while and I failed to break it with tests too.

A new PHP configuration setting mysqlnd.fetch_data_copy has been introduced. Setting mysqlnd.fetch_data_copy=1 makes mysqlnd use and enforce the copy approach for all buffered result sets.

Not for PDO yet

The copy approach is not compatible with PDO_MySQL, though. PDO_MySQL relies on the zval reference logic and must be updated first.

Credits to A + A = A+ ?!

The manual has got a few words on it too but not substantially more than is above. The changes should bubble through build, staging and mirror servers within a few days.

Happy hacking!

@Ulf_Wendel Follow me on Twitter

The performance penalty of the early MySQL Fabric support for PHP

2014/03/13 - by admin - 1 Comment

PECL/mysqlnd_ms 1.6 is currently being modified to support sharding and fully automatic server and client failover when using MySQL Fabric (slides) to manage a farm of MySQL servers. PECL/mysqlnd_ms is a mostly transparent load balancer that works with all PHP MySQL APIs (PDO_MySQL, mysqli, …). The idea is, that if, for example, a MySQL server fails, the plugin talks to MySQL Fabric to learn about alternative servers that Fabric has provisioned automatically. This “talks to” gives implies a performance penalty for applications. One worth looking at, to understand it. One worth looking at, to have a success story once the early implementation is gone and replaced with a proper one ;-).

Behind the scenes…

How exactly a “Fabric aware” driver or application talks to Fabric is implementation dependent. Figures given for PHP must not be used to extrapolate behaviour of Connector/J or Connector/Python. Only remarks about Fabric itself apply to all.

Let’s assume you want to use MySQL Fabric and PHP for sharding. Fabric takes care of all the server side stuff: splitting, merging, monitoring shards and so forth. The PHP application ensures that queries end up on the appropriate shards by hinting the driver which shard to use. In the PHP case, the “driver” is the PECL/mysqlnd_ms plugin for mysqlnd. The plugin exports a function mysqlnd_ms_select_shard() for hinting it.

$mysqli = new mysqli("myapp", "user", "password", "database");
mysqlnd_ms_select_shard($mysqli, "mydb.mytable", "key");
$mysql->query("INSERT INTO mytable(col1) VALUES ('abc')");

This tiny snippet triggers a huge machinerie: from you application it goes to the plugin. Then, the plugin calls Fabric via XML RPC over HTTP and waits for a reply. Once Fabric has replied, the plugin makes the connection handle point to the shard.

Client   Fabric
mysqlnd_ms_select_shard(link, table, key)  
PECL/mysqlnd_ms *.c XML RPC over HTTP: sharding.lookup_server(table, key, LOCAL) ->  
  HTTP worker thread
  Executor
  <- XML reply HTTP worker thread
PECL/mysqlnd_ms: make $link use shard
mysqli_query($link, …)

The hotspots

Switching a connection from one server to another takes some effort. The current implementation will simply replace an plugin internal list of servers. This is a very fast operation. No new MySQL connection is opened yet. By default, lazy connections are used and the connect to the shard is delayed until the application issues a query. Let’s consider this a cheap step, that can be marked green below.

Client   Fabric
mysqlnd_ms_select_shard(link, table, key)  
PECL/mysqlnd_ms *.c XML RPC over HTTP: sharding.lookup_server(table, key, LOCAL) ->  
  HTTP worker thread

Asking Fabric for a set of one master any number of additional slaves that make a shard is expensive. Upon every call to mysqlnd_ms_select_shard(), PECL/mysqlnd_ms opens a HTTP connection to Fabric to make an XML remote procedure call. Future version of the mysql plugin will do less calls, but that’s not the point. The expensive operation is the HTTP connection established using PHP streams. Say, you do one INSERT on a shard, then the INSERT carries significant overhead. Recall, I am discussing implementation details that must and will be tuned…

  • mysqlnd_ms_select_shard()
    • HTTP connect and round-trip to Fabric for XML RPC
    • switch user connection handle to shard(cheap)
  • mysqli_query()
    • connect MySQL shard
    • if needed and possible align connection state

The overhead can be measured using the plugins performance statistics. Set the PHP configuration directive mysqlnd_ms.collect_statistics=1 to enable the collection of statistics. Here’s the outline of a script that helps you detect what the plugin does and where it spents time in a MySQL Fabric sharding scenario. Please, see the manual for to setup the plugin to use Fabric.

$stats = mysqlnd_ms_get_stats();
$start = microtime(true);

$link = mst_mysqli_connect("myapp", $user, $passwd, $db, $port, $socket);
mysqlnd_ms_fabric_select_shard($link, "fabric_sharding.test", 1));
$now = mysqlnd_ms_get_stats();
foreach ($stats as $k => $v) {
 if ($now[$k] != $v) {
   printf("%s: %s -> %s\n", $k, $v, $now[$k]);
  }
}
$stats = $now;

var_dump($link->query("UPDATE test SET id = 1 WHERE id = 1"));
$now = mysqlnd_ms_get_stats();
foreach ($stats as $k => $v) {
  if ($now[$k] != $v) {
    printf("%s: %s -> %s\n", $k, $v, $now[$k]);
  }
}

printf("Runtime %.fs\n", microtime(true) - $start);

When run in a VM on my notebook it will print something like this. The UPDATE on a shard took in total some 0.038 seconds. Asking Fabric which shard to use took 31535 / 1000000 = 0.0315 seconds. 450 bytes have been transferred from Fabric to the plugin to learn about the shard.

bool(true)
fabric_sharding_lookup_servers_success: 0 -> 1
fabric_sharding_lookup_servers_time_total: 0 -> 31535
fabric_sharding_lookup_servers_bytes_total: 0 -> 450
bool(false)
use_master: 0 -> 1
use_master_guess: 0 -> 1
lazy_connections_master_success: 0 -> 1
Runtime 0.038586s
done!

Any slowdown you like can be provoked: 83x slower!

On my VM, a loop that executes mysqlnd_ms_select_shard() followed by an UPDATE 2000x times runs 83x slower than UPDATE taken when connecting to the shard without Fabric (but with the plugin loaded). The Fabric code takes 23.4 seconds in total and wastes 22.9 seconds on XML RPC. A plain UPDATE on non Fabric connection to the shard takes 0.28 seconds only! The difference between 23.4 – 22.9 = 0.5 and 0.28 seconds is down to the 2000x connects done by the plugin as part of the server/shard switch.

The Fabric hotspot

It certainly does not take 22.9 seconds to send 2000 HTTP replies of 450 bytes. Let’s ask Fabric to show what it does by setting the debug logging level to DEBUG in the Fabric configuration file.

[logging]
level = DEBUG

This is the the result. For every call to mysqlnd_ms_select_shard(), PHP performs one XML RPC and each XML RPC triggers more than 10 SQL queries within Fabric!

[DEBUG] 1394661746.143667 - XML-RPC-Server - Enqueuing request (<socket._socketobject object at 0x1e718a0>) from (('127.0.0.1', 53951)) through thread (<SessionThread(XML-RPC-Session-4, started daemon 140595192383232)>).
[DEBUG] 1394661746.143837 - XML-RPC-Session-4 - Processing request (<socket._socketobject object at 0x1e718a0>) from (('127.0.0.1', 53951)) through thread (<SessionThread(XML-RPC-Session-4, started daemon 140595192383232)>).
[DEBUG] 1394661746.144319 - XML-RPC-Session-4 - Started command (LookupShardServers).
[DEBUG] 1394661746.144816 - XML-RPC-Session-4 - Statement (SELECT sm.shard_mapping_id, table_name, column_name, type_name, global_group FROM shard_tables as sm, shard_maps as smd WHERE sm.shard_mapping_id = smd.shard_mapping_id AND table_name = %s), Params(('fabric_sharding.test',)).
[DEBUG] 1394661746.146071 - XML-RPC-Session-4 - Statement (SELECT sr.shard_mapping_id, sr.lower_bound, s.shard_id FROM shard_ranges AS sr, shards AS s WHERE %s >= CAST(lower_bound AS SIGNED) AND sr.shard_mapping_id = %s AND s.shard_id = sr.shard_id ORDER BY CAST(lower_bound AS SIGNED) DESC LIMIT 1), Params(('1', 1)).
[DEBUG] 1394661746.147233 - XML-RPC-Session-4 - Statement (SELECT shard_id, group_id, state FROM shards WHERE shard_id = %s), Params(('1',)).
[DEBUG] 1394661746.148331 - XML-RPC-Session-4 - Statement (SELECT group_id, description, master_uuid, status FROM groups WHERE group_id = %s), Params(('sharding1_shard1',)).
[DEBUG] 1394661746.149338 - XML-RPC-Session-4 - Statement (SELECT server_uuid, server_address, user, passwd, mode, status, weight FROM servers WHERE group_id = %s), Params(('sharding1_shard1',)).
[DEBUG] 1394661746.150462 - XML-RPC-Session-4 - Statement (SELECT @@GLOBAL.SERVER_UUID as SERVER_UUID), Params(()).
[DEBUG] 1394661746.151100 - XML-RPC-Session-4 - Statement (SELECT @@GLOBAL.SERVER_ID as SERVER_ID), Params(()).
[DEBUG] 1394661746.151648 - XML-RPC-Session-4 - Statement (SELECT @@GLOBAL.VERSION as VERSION), Params(()).
[DEBUG] 1394661746.152203 - XML-RPC-Session-4 - Statement (SELECT @@GLOBAL.GTID_MODE as GTID_MODE), Params(()).
[DEBUG] 1394661746.152757 - XML-RPC-Session-4 - Statement (SELECT @@GLOBAL.LOG_BIN as LOG_BIN), Params(()).
[DEBUG] 1394661746.153374 - XML-RPC-Session-4 - Statement (SELECT @@GLOBAL.READ_ONLY as READ_ONLY), Params(()).
[DEBUG] 1394661746.153820 - XML-RPC-Session-4 - Connected to server with uuid (80716d72-9302-11e3-817c-000c299b2a06), server_id (3307), version (5.6.16-log), gtid (True), binlog (True), read_only (False).
[DEBUG] 1394661746.153932 - XML-RPC-Session-4 - Disconnecting from server with uuid (80716d72-9302-11e3-817c-000c299b2a06), server_id (3307), version (5.6.16-log), gtid (True), binlog (True), read_only (False).
[DEBUG] 1394661746.154039 - XML-RPC-Session-4 - Finished command (LookupShardServers).
[DEBUG] 1394661746.154313 - XML-RPC-Session-4 - Finishing request (<socket._socketobject object at 0x1e718a0>) from (('127.0.0.1', 53951)) through thread (<SessionThread(XML-RPC-Session-4, started daemon 140595192383232)>).

Better do not compare that with the tiny PHP snippet used to select a shard…

Client   Fabric
mysqlnd_ms_select_shard(link, table, key)  
PECL/mysqlnd_ms *.c XML RPC over HTTP: 22.9 seconds  
  > 10 SQL queries against Fabric backing store
Connect to shard: 0.2s
UPDATE: 0.3s

Ghosts from the past

Of course, we will fix that for PHP even before releasing an alpha! Of course, that will make a nice story with a catchy “80x faster” title!

It is not necessary for a client to contact every time a shard server needs to be identified. There are other XML remote procedure calls that can be used. The reason why PHP has gone for this way initially is simple: the other XML RPCs have not been ready when the code was first written. Future versions (and the other drivers) do ask Fabric only once for a complete listing of all shards. The RPC overhead will then vanish in the overall runtime.

Client   Fabric
mysqli_connect(“fabric”, …)  
PECL/mysqlnd_ms *.c XML RPC over HTTP: get all shards ->  
  HTTP worker thread
  Executor
  <- XML reply HTTP worker thread
mysqlnd_ms_select_shard(link, table, key)  
PECL/mysqlnd_ms: make $link use shard
mysqli_query($link, …)
mysqlnd_ms_select_shard(link, table, key)  
PECL/mysqlnd_ms: make $link use shard
mysqli_query($link, …)

How big the performance impact of such a driver is, cannot be answered in general. It depends mainly on the question how often the once fetched shard information can be reused and how expensive the one RPC is in comparison to the normal queries issues. Tipps for benchmarking PHP have been given – please, run your own tests.

Happy hacking!

@Ulf_Wendel Follow me on Twitter

PS: We recently moved some Connector manuals out of the main MySQL reference manual to their own places. For PHP we mirror the MySQL bits from the PHP, as ever. The mirrored PHP MySQL manual can be downloaded as a PDF from dev.mysql.com. It already shows some PECL/mysqlnd_ms Fabric notes whereas the PHP mirrors have not yet caught up to the PHP documentation staging server – http://docs.php.net/manual/en/book.mysqlnd-ms.php.

The early MySQL Fabric sharding support for PHP

2014/03/08 - by admin - 0 comments

The MySQL Fabric framework brings two major features: automatic client- and server-side failover and sharding. The manual hints, you need a “Fabric aware driver” for this but it does not list one for PHP. First, you don’t necessarily need new drivers! Second, the "driver" for PHP is the current development version of the PECL/mysqlnd_ms replication and load balancing plugin. The plugin covers the sharding but not the failover/HA quickstart example of Fabric: how the plugin works and why you should not use the plugin – yet.

Partial replication: the unlimited mode

At some point divide-and-conquer is the only know strategy to scale database replication. A huge data set must be devided into smaller sets that are distributed over many servers. There are middleware, in-core and hybrid architectures for such a system. The Fabric approach could be described as a middleware approach: a middleware manages servers and data distribution, clients talk to the middleware to learn about servers. Such an architecture minimizes the dependencies on the database. Power users can easily adapt the middleware to their needs.

PECL/mysqlnd_ms now also handles the "learn about servers/shards" part in the background. It talks to Fabric through XML RPC over HTTP.

MySQL Fabric PHP application
mysqli/PDO_MySQL
<-> learn about servers/shards PECL/mysqlnd_ms
| |
  • Provision: masters and slaves
  • Monitor: load, health/failover
  • Balance: clone, merge, split shards
  • Connect and query
| |
MySQL servers

The client view: all the standard MySQL APIs

PECL/mysqlnd_ms is a client-side proxy which tries to hide as much of the complexity of using any kind of MySQL cluster (MySQL Replication, MySQL Cluster, 3rd party, MySQL Fabric sharding/HA) from the developer. This includes tasks such as load balancing, read-write splitting, failover and so forth. In the most basic case it is entirely transparent on the API level (mysqli, PDO_MySQL). Given the appropriate configuration, this is a load balanced connection, the SELECT goes to the slaves, the DROP to the master…

$link = new mysqli("myapp", "user", "password", "db");
$link->query("SELECT 1");
$link->query("SELECT 2");
$link->query("DROP TABLE IF EXISTS test");


What happens is that mysqli respectively PDO_MySQL extensions call functions in the mysqlnd library. PECL/mysqlnd_ms plugs in to the mysqlnd library to hooks these calls. If, for example, mysqli_connect() tries to open a connection to the host myapp, PECL/mysqlnd_ms captures the call and checks it config for an entry named myapp. Let the entry be for a MySQL Replication cluster. Then, later when mysqli_query() is executed, the plugin inspects the query and picks the a master or slave from the config to run the query on. Connecting the acutal servers is (mostly) transparent from an application user perspective as you can see from the code example.

userspace – *.php mysqli_connect(…)
inside PHP – ext/mysqlnd/*.c mysqlnd_connect(…)
inside PHP – ext/mysqlnd_ms/*.c connect_hook(…) 1) read server list from file: fopen(file://…)

The PHP manual has all the details, including the pitfalls and why you can overrule any automatic decision the plugin makes.

Fabric: similar, yet very different

The plugin works much the same with MySQL Fabric. The major difference is that instead of reading a list of servers from a local configuration file, the plugin now connects to a Fabric host to get the list. This makes no difference with regards to the application code itself. When exactly the plugin calls Fabric is work in progress.

userspace – *.php mysqli_connect(…)
inside PHP – ext/mysqlnd/*.c mysqlnd_connect(…)
inside PHP – ext/mysqlnd_ms/*.c connect_hook(…) 1) read config for Fabric host list: fopen(file://…)
2) read server list from Fabric: fopen(http://…)

Because Fabric monitors and manages the MySQL servers in a farm, it always reports a fresh snapshot of the servers available. In case a server fails and Fabric decides to replace it, the client (here: PECL/mysqlnd_ms) will learn with the next RPC. When using local configuration files, the client will not recognize new or failed servers in the cluster until the configuration file has been deployed. The XML RPC over HTTP call from the client to Fabric to fetch the server adds overhead but it ensures fresh information. Choose between runtime overhead and deployment complexity as you like…

Measuring and taming the XML RPC overhead

To help with the choice, we’ve begun to add new performance statistics, which are subject to change.

Related to Fabric command sharding.lookup_servers
fabric_sharding_lookup_servers_success Number of successful RPC calls. A call is considered succesful if any Fabric host could be reached, a message has been sent and a reply has been received.
fabric_sharding_lookup_servers_failure Number of failed RPC calls.
fabric_sharding_lookup_servers_time_total Total time spent (ms).
fabric_sharding_lookup_servers_bytes_total Total bytes received.
fabric_sharding_lookup_servers_xml_failure How often the plugin failed to parse the reply from Fabric. Currently, the plugin
cannot distinguish between an valid error reply and an erroneous one.

To avoid long wait periods for Fabric, there is also a new timeout setting for the plugin configuration. If case of a timeout, you end up with no server list. Then you cannot do much but the most important: tell the impatient user about the problem before he begins hitting the Browsers reload button like crazy.

{"myapp":{"fabric":{"hosts":[{"host":"127.0.0.1","port":8080}],"timeout":2}}}


Setting a timeout has the same effect as setting a timeout for a PHP stream in userland. Judging from a quick glance of the streams code, my uneducated guess is that it sets a timeout for connect and read but not for write. Whatever it does is beyond the control of PECL/mysqlnd_ms.

Use case supported today: sharding

Out of the two major use cases for Fabric, the PHP plugin so far covers the sharding one only (in parts). Fabric has opted for a design where the clients have to give keys ex-bound. The client has to ask Fabric for a list of servers responsible for a given table and key. Then, Fabric replies and the client picks the appropriate server to run the query on. The MySQL manual and various blog postings over the details of the sharding logic. Johannes gives it a quick walk through and shows the following PHP code (comments removed):

$c = new mysqli("test", "root", "", "test");

mysqlnd_ms_fabric_select_global($c, "test.fabrictest");
var_dump($c->query("CREATE TABLE fabrictest (id INT NOT NULL)"));

mysqlnd_ms_fabric_select_shard($c, "test.fabrictest", 10);
var_dump($c->query("INSERT INTO fabrictest VALUES (10)"));

mysqlnd_ms_fabric_select_shard($c, "test.fabrictest", 10010);
$r = $c->query("SELECT * FROM fabrictest WHERE id = 10");
var_dump($r->fetch_row());

The two functions mysqlnd_ms_fabric_select_global() and mysqlnd_ms_fabric_select_shard() encapsulate the procedure of asking Fabric for the shards to use for the table test.fabrictest. They also tell Fabric what you want to do: perform DDL operations (here: CREATE TABLE) or manipulate data asociated with a shard key.

This short snippet calls Fabric three times. It is not hard to predict that this is a potential bottleneck – use the statistics to measure it. Caching is planned for the future.

Open ends, pitfalls, plugin logic breaks

The Fabric code in the plugin is almost untested. It was developed against a pre-pre-release of Fabric. Statistics are bells and whistles compared to – long ago – frequently changing XML formats. Meanwhile Fabric is stabilizing and we can look into a better integration with the rest of the 25k lines of the plugins C code.

Here are two examples of the details not taken care of. Fabric itself does not (yet) bother much about transactions but PECL/mysqlnd_ms has a transaction stickiness feature that will prevent switching servers in the middle of a transaction. The plugin tries to detect transaction boundaries and will not load balance between servers before the current transaction has ended. What should the plugin do in this case?

$c->begin_transaction();
mysqlnd_ms_fabric_select_shard($c, "test.fabrictest", 10);
var_dump($c->query("INSERT INTO fabrictest VALUES (10)"));

mysqlnd_ms_fabric_select_shard($c, "test.fabrictest", 10010);
$r = $c->query("SELECT * FROM fabrictest WHERE id = 10");
var_dump($r->fetch_row());


It is clearly a user error to start a transaction and then switch between shards. What should the plugin do if transaction stickiness is set? Shouldn’t we at least warn the user, if we can – I’ve pushed an first attempt for a warning mode today.

{"myapp":{"fabric":{"hosts":[{"host":"127.0.0.1","port":8080}],"trx_warn_serverlist_changes":1}}}

The Quality-of-Service filter likely needs a couple of tweaks too.

There are just too many cases not tested yet. For example, I would not be surprised if the following code failed without a useful message. In the current implementation mysqli_connect() will not do an actual connect to Fabric, the server lists used by the plugin will be empty and it will bark…

$c = new mysqli("fabric", "root", "", "test");
$c->query("SELECT 1");
mysqlnd_ms_fabric_select_global($c, "test.fabrictest");


Certainly, this is a minor issue. Be warned that there could be more and be warned about the RPC overhead.

The plugin has 26752 lines of C code and 47481 lines of .phpt test code. I guess we have to push it to 50kloc until we can release an PECL/mysqlnd_ms 1.6 alpha. Not sure if we ever got a bug report between alpha and GA, in any case, I would not want to change that…

Happy hacking!

@Ulf_Wendel Follow me on Twitter

PS: I’ve done a mysqlfabric help today using the latest (internal) development version. There was a proper help message no error telling me my cry for help was not understood! Everything around Fabric is a fast moving target.

Using MySQL Fabric from any programming language

2014/03/04 - by admin - 5 Comments

MySQL Fabric is a framework for MySQL Replication high availability, automatic failover and sharding. Technically, a MySQL Fabric daemon monitors a set of MySQL servers and takes appropriate actions upon failure. Clients use Fabric aware drivers to learn about failed servers and shards to distribute queries accordingly. Simple to understand, simple to sell, simple to raise false expectations and simple to fail [, dear Sales]. With the usual blog posts telling only the story of the first three sentences, major parts of the story are covered in silence.

Development preview = announcement of a vision != ready

You first challenge will be to find the documentation for the MySQL Fabric development preview. From the documentation overview page it takes three clicks down to the server side documentation for Fabric:

  1. MySQL Workbench
  2. MySQL Utilities 1.4+ (not the older one!)
  3. MySQL Fabric

You better do not start your search in the MySQL Reference Manual under High Availability and Scalability. Fabric, a command line utility, is well hidden inside the documentation of a GUI tool. Eventually, you may find a download and attempt to install Fabric using an install prefix.


~/mysql-utilities-1.4.1/install/bin # ./mysqlfabric manage start
~/mysql-utilities-1.4.1/install/bin # Traceback (most recent call last):
  File "./mysqlfabric", line 23, in <module>
    from mysql.fabric.services import (
ImportError: No module named mysql.fabric.services
~/mysql-utilities-1.4.1/install/bin # export PYTHONPATH=/home/nixnutz/ftp/mysql-utilities-1.4.1/install/lib/python2.7/site-packages/

80% of the development community is out, true?

Pretty much all blogs and tutorials claim that Fabric aware drivers must be used. According to the manual, such drivers exist for Python and Java only. This covers an estimated 20% of the software development communities. Let’s cry for help:


~#47;mysql-utilities-1.4.1/install/bin # ./mysqlfabric help
Usage: %fabric <group> <cmd> [<option> ...] arg ...

mysqlfabric: error: Error (dispatch() takes exactly 3 arguments (1 given)).
Wrong number of parameters were provided for command (manage help).

~#47;mysql-utilities-1.4.1/install/bin # ./mysqlfabric help help help
Command (help, help) was not found.

Being among the 20% of priviledged Python or Java users you may not bother until you browse the etc/ directory for configuration files. Here you will find something seemingly related to PHP’s Doctrine


~/mysql-utilities-1.4.1/install/bin # ls -la ../etc/mysql/
insgesamt 32
drwxr-xr-x 2 nixnutz users  4096 11. Feb 16:34 .
drwxr-xr-x 3 nixnutz users  4096 11. Feb 16:33 ..
-rw-r----- 1 nixnutz users   119 11. Feb 16:34 default_plugins.cnf
-rw-r--r-- 1 nixnutz users   527 11. Feb 16:34 fabric.cfg
-rw-r--r-- 1 nixnutz users 13132 11. Feb 16:34 mysql-fabric-doctrine-1.4.0.zip

There is something for the PHP driver but it is not documented. Why there is something for a PHP application before the driver part is ‘done done’ may make some scratch their heads.

Putting expectations straight and low

Any project should state what is included and what is not. Fabrics’ vision should be clear and bright. Please, see the slides.

But, Fabric is available as a development preview release only. Core functionality exists. The examples from the manual work for me. But, no other scenarios do. For example, mysqlfabric group import_topology lacks documentation and mysqlfabric group promote blog_repl fails for me. Likely, the latter cannot parse the GTID histories from my replication setup:


[DEBUG] 1393921605.845434 - Executor-3 - Start executing function: get_num_gtid(('0d4adf79-9329-11e3-8278-000c299b2a06:1-2,\n1113b4b6-9326-11e3-8264-000c299b2a06:8-9:13:40-45:48-54:56-58:60-62:64-66:69-71:73-74:76-78:80-82:84-85:88-90:92-93:95-98:100-101:103-106:109:111:115-118:121:123:127-130:132-135:138-140:143-149:151-153:155-157:159-161:164-166:168-169:171-173:175-177:179-180:183-185:187-188:190-193:195-196:198-201:204:206:210-213:216:218:222-225:227-230:232-234:236-238:241-247:249-251:253-255:257-259:262-264:266-267:269-271:273-275:277-278:281-283:285-286:288-291:293-294:296-299:302:304:308-311:314:316:320-323:325-328:331-337:339-341:343-345:347-349:352-354:356-357:359-361:363-365:367-368:371-373:375-376:378-381:383-384:386-389:392:394:398-401:404:406:410-413:415-418:420-422:424-426:428-430:432-434:436-438:441-447:449-451:453-455:457-459:462-464:466-467:469-471:473-475:477-478:481-483:485-486:488-491:493-494:496-499:502:504:508-511:514:516:520-523:525-528:531-536:539:541:545-549:551-553:555-557:559-561:564-566:568-569:571-573:575-577:579-580:583-584:586-587:589-592:594-595:597-600:603:605:609-612:615:617:621-624:626-629:632-637:640:642:646-650:652-654:656-658:660-662:665-667:669-670:672-674:676-678:680-681:684-685:687-688:690-693:695-696:698-701:704:706:710-713:716:718:722-725:727-732:735-740:743:745:749-753:755-757:759-761:763-765:768-770:772-773:775-777:779-781:783-784:787-789:791-792:794-797:799-800:802-805:808:810:814-817:820:822:826-829:831-834:836-838:840', None), {}).
[DEBUG] 1393921605.845557 - Executor-3 - Error executing function: get_num_gtid.

Developers life without Fabric aware drivers

Fabric aware drivers are part of the vision, they contribute to the developer experience. As you are waiting for things to mature and Fabric aware drivers to enter ‘done done’ state, you can try to write a Fabric aware application without a special driver. This may be required on the long run as well: MySQL does not provide a driver for each and every language. Also, if you want to use an old version of a driver or gather some background information about how to debug a driver, knowing what “Fabric aware” means is a must.

A Fabric aware PHP [C/C++, Perl, Ruby, …] application

The Fabric daemon and Fabric aware drivers/applications communicate with each others through XML RPC over HTTP, just like the mysqlfabric command line utility does. The RPC interface can be used, for example, to learn about running and failed servers and their roles. You can look up (logical) server groups, shard mappings, individual shards and so forth. Fabric aware drivers try to do these operations in the background in a way that is mostly transparent for the application. A Fabric aware application needs to the the XML RPC itself.

Fabric aware driver/application communicates with (XML RPC over HTTP) -> Fabric
mysqlfabric command line utility
|
manages
|
Farms of MySQL Servers

The list of RPC commands is not documented, but you can ask the mysqlfabric utility for a list of commands:

# ./mysqlfabric list-commands
group activate                   Activate a group.
group import_topology            Try to figure out the replication topology and import it into the state store.
group deactivate                 Deactivate a group.
group create                     Create a group.
group remove                     Remove a server from a group.
group add                        Add a server into group.
group lookup_servers             Return information on existing server(s) in a group.
group check_group_availability   Check if any server within a group has failed and report health information.
group destroy                    Remove a group.
group demote                     Demote the current master if there is one.
group promote                    Promote a server into master.
group lookup_groups              Return information on existing group(s).
group description                Update group's description.
manage list-commands             List the possible commands.
[...]


Generally speaking each command listed maps to one XML RPC call. The XML RPC call parameters and – possibly – some help can be obtained using mysqlfabric help <command>.

# ./mysqlfabric help group lookup_groups 
group lookup_groups(group_id): Return information on existing group(s).


Often, this is enough input to craft the XML required for the corresponding XML RPC call:

#  cat lookup_groups.xml
<?xml version="1.0" encoding="iso-8859-1"?>
<methodCall>
  <methodName>group.lookup_groups</methodName>
</methodCall>


Give it a try and POST the XML to the Fabric server, which listens on port 8080 by default, or whatever port you configured during the installation:

# curl  -H 'Content-Type: text/xml' -d @./lookup_groups.xml http://127.0.0.1:8080
<?xml version='1.0'?>
<methodResponse>
<params>
<param>
<value><array><data>
<value><boolean>1</boolean></value>
<value><string></string></value>
<value><array><data>
<value><array><data>
<value><string>blog_repl</string></value>
</data></array></value>
</data></array></value>
</data></array></value>
</param>
</params>
</methodResponse>


The XML RPC was successful and Fabric has returned a list of all configured high availability groups (see also the Fabric quickstart). Compare the XML reply with the reply from the mysqlfabric command line utility. The information is identical but presented differently.

# ./mysqlfabric  group lookup_groups 
Command :
{ success     = True
  return      = [['blog_repl']]
  activities  = 
}


Teach your application to issue an XML RPC call, parse the result and you have a Fabric aware application…

Fabric aware application: HA group support

Assume you are using Fabric for nothing but monitoring a simplistic group of three MySQL Replication servers, the case sketched in the Fabric quickstart. There is one master and two slaves, no further constraints or rules exist.

Fabric
monitors, handles failover/switchover, …
|
Fabric high availability group ‘blog_repl’
Master/Primary
127.0.0.1:3309
| |
Slave/Secondary Slave/Secondary
127.0.0.1:3308 127.0.0.1:3310

If an application wants to connect to such a logical Fabric high availability group it first needs to know the name of the group. Assume you have configured only one group and the groups name ‘blog_repl’ is known to the application. Then, all a most basic Fabric aware application would want to know from Fabric is:

  • Which servers belong to the high availability group ‘blog_reply’?
  • Which one is the master (writes), which ones are the slaves (reads)?

The Fabric command which answer these questions is:

 # ./mysqlfabric  group lookup_servers blog_repl
Command :
{ success     = True
  return      = [['28acd755-a37b-11e3-ace4-000c299b2a06', '127.0.0.1:3308', False, 'SECONDARY'], ['c3d24cc0-a37a-11e3-ace1-000c299b2a06', '127.0.0.1:3310', False, 'SECONDARY'], ['f159bc2e-a37a-11e3-ace2-000c299b2a06', '127.0.0.1:3309', True, 'PRIMARY']]
  activities  = 
}


Using the help output of the command, you can construct the XML request that a Fabric aware application has to send to get the questions answered:

# cat lookup_servers.xml
<?xml version="1.0" encoding="iso-8859-1"?>
<methodCall>
  <methodName>group.lookup_servers</methodName>
  <params>
    <param>
      <value><string>blog_repl</string></value>
    </param>
  </params>
</methodCall>


The verbose answer from Fabric is below. Your application must parse the XML and extract the relevant information to learn about the servers and their roles in the high availability group ‘blog_repl’:

# curl  -H 'Content-Type: text/xml' -d @./lookup_servers.xml http://127.0.0.1:8080
<?xml version='1.0'?>
<methodResponse>
<params>
<param>
<value><array><data>
<value><boolean>1</boolean></value>
<value><string></string></value>
<value><array><data>
<value><array><data>
<value><string>28acd755-a37b-11e3-ace4-000c299b2a06</string></value>
<value><string>127.0.0.1:3308</string></value>
<value><boolean>0</boolean></value>
<value><string>SECONDARY</string></value>
</data></array></value>
<value><array><data>
<value><string>c3d24cc0-a37a-11e3-ace1-000c299b2a06</string></value>
<value><string>127.0.0.1:3310</string></value>
<value><boolean>0</boolean></value>
<value><string>SECONDARY</string></value>
</data></array></value>
<value><array><data>
<value><string>f159bc2e-a37a-11e3-ace2-000c299b2a06</string></value>
<value><string>127.0.0.1:3309</string></value>
<value><boolean>1</boolean></value>
<value><string>PRIMARY</string></value>
</data></array></value>
</data></array></value>
</data></array></value>
</param>
</params>
</methodResponse>

XML RPC from PHP and parsing the reply

Ignoring all the glory details, the code – here PHP – to fetch the list of servers in a group from Fabric is rather trivial. The example implements a class with three methods to fetch a list of all servers, a list of the read-only slaves and a list of the read-write master. From here on, the classic pattern of using one connection for read-only requests and one connection for writes could be used. Of course, this is not very comfy. This is where the various Fabric aware drivers try to deliver better options…

<?php
class fabric_ha {
  protected $server;
  protected $group_id;
  protected $xml = '<?xml version="1.0" 
encoding="iso-8859-1"?><methodCall><methodName>group.lookup_servers</methodName><params><param><value><string>%s</string></value></param></params></methodCall>';

  public function __construct($server, $group_id) {
   $this->server = sprintf("http://%s/",$server);
   $this->group_id = $group_id;
  }

  public function getServerList($role = NULL) {
    $xml = sprintf($this->xml, $this->group_id);

    $opts = array('http' =>
      array(
        'method'  => 'POST',
        'header'  => 'Content-type: text/xml',
        'content' => $xml
      )
    );
    $context = stream_context_create($opts);

    $ret = array();
    $result = file_get_contents($this->server, false, $context);
    $xml = new SimpleXMLElement($result);
    $server_list = $xml->xpath('/methodResponse/params/param/value/array/data/value[3]/array/data/value');
    foreach ($server_list as $server_entry) {
      $server = $server_entry->xpath('array/data/value');
      if (!is_null($role) && ($server[3]->xpath('string')[0]->__toString() != $role)) {
        continue;
      } 
      $ret[] = array(
                  'server_uuid' => $server[0]->xpath('string')[0]->__toString(),
                  'server'      => $server[1]->xpath('string')[0]->__toString(),
                  'master'      => $server[2]->xpath('boolean')[0]->__toString(),
      );
    }
    return $ret;
  }

  public function getReadWriteServer() {
    return $this->getServerList('SECONDARY');
  }

  public function getReadOnlyServer() {
    return $this->getServerList('PRIMARY');
  }
}

$f = new fabric_ha("127.0.0.1:8080", 'blog_repl');
var_dump($f->getReadWriteServer());
var_dump($f->getReadOnlyServer());


How do you get to know the XPath expressions, I’ve used? That is an excellent question to ask at the next conference when you meet the designers of Fabric. Whatever, there’s the script in action:

# php fabric.php 
array(2) {
  [0]=>
  array(3) {
    ["server_uuid"]=>
    string(36) "28acd755-a37b-11e3-ace4-000c299b2a06"
    ["server"]=>
    string(14) "127.0.0.1:3308"
    ["master"]=>
    string(1) "0"
  }
  [1]=>
  array(3) {
    ["server_uuid"]=>
    string(36) "c3d24cc0-a37a-11e3-ace1-000c299b2a06"
    ["server"]=>
    string(14) "127.0.0.1:3310"
    ["master"]=>
    string(1) "0"
  }
}
array(1) {
  [0]=>
  array(3) {
    ["server_uuid"]=>
    string(36) "f159bc2e-a37a-11e3-ace2-000c299b2a06"
    ["server"]=>
    string(14) "127.0.0.1:3309"
    ["master"]=>
    string(1) "1"
  }
}

Failover: faster than I can type…

If now the master of your MySQL Replication setup fails, Fabric may detect this and promote a slave to become a master. I’ve shutdown my master on 127.0.0.1:3309 and Fabric has instantly reconfigured the replication setup as illustrated below. On my unloaded notebook the entire server-side failover takes less time than it takes me to write this sentence.

Fabric
server side failover performed
|
Fabric high availability group ‘blog_repl’
Master/Primary
127.0.0.1:3308
|
Slave/Secondary
127.0.0.1:3310

As soon as the master becomes unavailable, an application will see transaction aborts and connection errors. A Fabric aware application should now breath for a second and ask Fabric for a fresh list of servers. Then, the Fabric aware application shall begin using the updated list of servers. I’ve omitted the error handling code and show the result of fetching a fresh server list only.

nixnutz@linux-dstv:~/src/php-src> php fabric.php 
array(1) {
  [0]=>
  array(3) {
    ["server_uuid"]=>
    string(36) "c3d24cc0-a37a-11e3-ace1-000c299b2a06"
    ["server"]=>
    string(14) "127.0.0.1:3310"
    ["master"]=>
    string(1) "0"
  }
}
array(1) {
  [0]=>
  array(3) {
    ["server_uuid"]=>
    string(36) "28acd755-a37b-11e3-ace4-000c299b2a06"
    ["server"]=>
    string(14) "127.0.0.1:3308"
    ["master"]=>
    string(1) "1"
  }
}


Note that I say your application shall take a breath: throttle Fabric requests. Fabric’s slow Python HTTP XML RPC – cannot handle tousands of concurrent requests from all your PHP webservers (development preview…). There’s a simple proof for this overload prediction that you will see if you put Fabric in DEBUG mode and analyze what it does (it runsn some 20 SQL statements per XML RPC).

As you can see, Fabric may be of use to you even if there is no Fabric aware driver for your preferred programming language be it PHP, C/C++, Perl, Ruby or whatever else…

Happy hacking!

@Ulf_Wendel Follow me on Twitter

Novice user experience – ramblings

In discussions about Fabric, people sometimes express the hope using MySQL Replication would become easier. Easier for common, small installations with, say, one master and two slaves. Good news is, that Fabric can be used for such tasks, if using MySQL 5.6 and newer only. Fabric depends on MySQL Replication with Global Transaction IDs introduced with MySQL 5.6:

# ./mysqlfabric group add blog_repl 127.0.0.1:3310 root ''
Procedure :
{ uuid        = f417492c-1a11-460b-87a5-fd8df9d811d8,
  finished    = True,
  success     = False,
  return      = ServerError: Server (0d4adf79-9329-11e3-8278-000c299b2a06) does not have the binary log or gtid enabled.,
  activities  = 
}


This is no big deal, configure MySQL to use GTIDs! The MySQL Utilities will not do this for you, all of them implicitly require some MySQL configuration. But as a first time user who does not want to invest more than five minutes in his experiement, how will you know which of the 440 server configuration options (MySQL 5.7) so set?

The PDF version of the MySQL manual counts 3700+ pages. In order to profit from the comfort offered by the MySQL Replication Utilities you should preconfigure your MySQL server with the following settings:

  • Page 2054 – log_bin
  • Page 2055 – server_id, innodb_flush_log_at_trx_commit, sync_binlog
  • Page 2073 – gtid_mode, log-slave-updates, enforce-gtid-consistency

But do not miss the following, otherwise you end up in a hen-egg game because some Utilities, including Fabric, use SHOW SLAVE HOSTS on occasion.

  • Page 1606 – SHOW SLAVE HOSTS
  • Page 2101 – report_host
  • Page 2101 – report_port

Creating a HA setup with Fabric is peanuts and be done in one minute. But how long does it take to get there?

PHP Memcache access to MySQL 5.7, faster? Redis?

2013/12/12 - by admin - 5 Comments

PHP users can use two client protocols to query MySQL 5.6 and later. Not only standard SQL access but also faster key-value access to InnoDB tables is possible using the Memcache protocol. The MySQL benchmark team reports crazy figures. Of course, on hardware that makes the average PHP meetup visitor roll his eyes and say “yeah, Oracle, *yawn*…”. I’ve repeated my plain PHP benchmarks on an i3 desktop. And, I’ve added Redis to the game.

Short recap

Some ten years ago, colleaguages teached me that some 50% of the time processing a simple SELECT column_a, column_b FROM table WHERE pk = <x> is spent on parsing and processing SQL. The actual data access only contributes to 50% of the overall run time. Thus, MySQL had always offered some lower storage layer API access through the HANDLER command. Fast forward ten years, you can query MySQL either through SQL, as ever since, or through the Memcache protocol. The latter will bypass the SQL layers, get you directly to the storage and thus be faster. The presentation has the details.

Given how popular read only query like SELECT column_a, column_b FROM table WHERE pk = <key> can be, it may be worth scanning your code for them and replacing them with ~2x faster Memcache accesses. PECL/mysqlnd_memcached tries to do this in an automated fashion. It will match your queries against a pattern and try to replace a key-value style MySQL SQL access to a MySQL Memcache access. Nice in theory but the pattern matching may take that much time that is it not worth it (see also here). Please, run your own tests.

PHP
ext/mysqli, PDO_MYSQL, … PECL/memcached
|
SELECT column_a, column_b FROM table WHERE pk = <key> GET <key>
|
InnoDB SQL table

If automatic mapping from SQL to Memcache protocol may be too slow, manually replacing some calls is certainly not.

MySQL 5.6 Memcache vs. MySQL 5.7 Memcache vs. Memcache vs. SQL

To see whether MySQL 5.7 Memcache access is faster than MySQL 5.6 Memcache access, I’ve rewritten my PHP benchmark script. The basic procedures are exactly as before. I’m using PHP 5.7.0-dev with PECL/memcached 2.2.0b1 (libmemcached 1.0.16), mysqli, phpiredis. MySQL 5.7.3-m13, MySQL 5.6.15, Memcache 1.4.15, Redis 2.8.2 and PHP have been compiled from source using all defaults. The computer being tortured is a budget i3-2120T CPU @ 2.60GHz, 8GB RAM, Linux RAID-0, ext4, OpenSuse 12.1 desktop PC.

Here are the results for random key/value accesses with a key length of 100 bytes, values of 400 bytes lenght and 25.000 values in total. All reads have been repeated a couple of times to ensure that runtimes are not too short. PHP, Memcache and MySQL all run on one box. PHP uses heavyweight fork() to span workers for concurrent access to the data stores. The graph shows the average number of operations per second as observed by a single PHP worker. If running on the box 1x MySQL 5.7 w. Memcache and…

  • 1x PHP process, PHP script does 21556 ops
  • 2x PHP processes, each PHP does on average 21538 ops
  • 3x PHP processes, each PHP does on average 17816 ops

As expected the small CPU cannot handle more than 4-8 concurrent PHP processes sending queries as fast as they can: beyond that point each PHP processes will observer a significantly lower number of operations per second. Please, note the graph does not show the data stores’ view which would be total number of the answers to any client per second.

If you prefer a different graph: the benchmark script is at the end of the blog posting :-).

MySQL 5.6 vs. MySQL 5.7 Memcache results

The results confirm prior findings:

  • MySQL 5.7 Memcache InnoDB access to SQL table reaches ~75% of the performance of a cache
  • MySQL 5.7 Memcache access is ~2.2x to ~2.5x faster than SQL
  • MySQL 5.7 seems ~ 20% faster than MySQL 5.6

It is cool to see a SQL system come close to the performance of a cache but the rest of the story is boring: new version, seems a bit faster on small boxes, could be way faster on Facebook-style boxes.

MySQL vs. Memcache vs. Redis

Comparing with Redis is hard but fun. The REmote DIctionary Server is more than a cache storing BLOBs. It handles complex data structures, such as lists or sorted maps. Redis offers neat publish-subscribe messages queues and much more. Still, it can be used as a cache. So can MySQL. Any quick benchmark is doomed to ignore this. So do I.

I am also ignoring the asynchronous API of Redis, which may bear significant tuning potential. Our MySQL Cluster folks love the asynchronous API for their benchmarks…

Redis is single threaded. One Redis instance cannot utilize more than one CPU thread. Redis likes fast CPUs but does not know what to do with a multi-core monster. This effect does not kick in on crazy 48-way boxes only but already my i3 desktop. MySQL uses all cores and CPU thread it gets, Redis is limited to a single CPU thread. Thus, for a fair comparison, I have to start multiple Redis instances. It is then the application developers task to pick the appropriate Redis instance.

Core 1
CPU Thread 1 – Redis CPU Thread 2
Core 2
CPU Thread 3 CPU Thread 4

An i3 has two cores and counting in hyper-threading 4 CPU cores. Therefore, I’ve compared MySQL and Memcache with 1, 2 and 3 Redis instances running. PHP workers are assigned in a round robin fashion to the Redis instances based on the workers process id. All instances are loaded with the same data set for the read-only test.

Core 1
CPU Thread 1 – Redis 1 CPU Thread 2 – Redis 2
Core 2
CPU Thread 3 – Redis 3 CPU Thread 4

Here’s the result: MySQL 5.7 InnoDB Memcache beats a single Redis 2.8.2 on read performance. If and only if, you use Redis inapproriately. If you bite the bullet and, for example, you partition your data across many Redis instances, then Redis is faster. However, at least on my desktop the difference is not in the order of magnitudes which is quite amazing for a SQL database.

MySQL 5.7 vs. Memcache vs. Redis

Closing ramblings

In the past decade MySQL has constantly managed to utilize latest generation of commodity hardware CPUs efficiently. Whether it was 4-core, 8-core or now 48-core (or more CPU’s), we’ve been there. In the past three years, since Handlersocket appeared, there is a hunt for higher and higher benchmark results in terms of queries per second. I understand that MySQL flagship customers demand such performance.

However, when speaking at a local PHP meetup a 48-core benchmark is worth little more than a good laugh. First, the machines are out of reach. The majority of the audience will use 8-core/32GB max. Some have less than three such machines in total. Second, as exciting it is to learn MySQL can do 1,000,000 queries/s, the PHP meetup visitor sometimes wants to hear about how his job as a developer becomes easier when using this or that data store.

Again, MySQL read performance is close to cache performance. Wow! Maybe writes could be even faster – please run your own tests. My benchmark script is below. It can do writes as well.

But, what about usability and features that make application development easier? Did we forget about that?

Happy hacking!

@Ulf_Wendel Follow me on Twitter

Benchmark script used

Download Benchmark script (PHP, 21k)

Data massage: how databases have been scaled from one to one million nodes

2013/12/03 - by admin - 0 comments

Despite the CAP theorem, databases have been scaled from one to one million nodes. Globally distributed, highly available and fast. Databases supporting ACID transactions. How? A story and slides from a three hours workshop at the PHP Summit 2013 (Berlin)…

Please note: long, deep dive, quite some theory. And, neither funny NoSQL bashing nor a practical guide to MySQL scaling. Only 10 MySQL slides or so…

As we all know, traditional relational databases have seen massive pressure from NoSQL in recent time. They have been considered unflexible, slow and not been designed for the Cloud. Or, just too expensive, if massive scalability was needed. It was argued, the CAP theorem basically means pick “two out of three” of consistency, availability and partition tolerance. Not long ago Brewer himself has said a word or two on the abuse of CAP for fuzz.

The talk starts in the 60th and 70th with relational databases before it gets to explore CAP and early NoSQL systems. Two early systems at the oposite ends of the design space CAP allows are Amazon Dynamo and Bigtable. They became blueprints for so many distributed NoSQL systems: Riak, HBase, BigCouch, … Then came Google Spanner (ACID transactions), which is highly available, replicates synchronously between data centers and is fast – how?

Some distributed NoSQL systems innovate at the borderline between distributed systems theory and the theory of parallel and distributed databases. Unless you work with a system at one of the extremes of the CAP design space, such as Riak, this is never much visible to you. Time to look at their algorithms and poke around the question what their findings may mean to relational systems that support ACID transactions.

It turns out that it could be fruitful to learn from each other. Some NoSQL folks seem to dig into distributed SQL query execution and on-disk storage. Topics from the theory of (parallel) databases. RDBMS can NoSQL can intersect in topics such as co-location in the presence of partitioning to avoid distributed queries/transactions. RDBMS can try to move towards microkernels and loosely coupled modules – something no classic RDBMS developer would ever do as it may mean longer paths (= slower)… – and so forth.

Whatever. You can look at NoSQL vs. RDBMS from a “relation vs. JSON”, “joins are evil” perspective. Or, you dig deeper to see how they scale from one to one million node. Unfortunately, you may come across topics that may be new to you as a (PHP) application developer:

  • Overlay networks, distributed hash tables
  • Vector clocks, quorums, CRDT’s
  • FLP impossibility theorem, Paxos protocol family
  • Virtual Synchrony, Atomic Broadcast

If you don’t mind that, you may answer the question whether Spanner proves CAP wrong yourself. The talk will also help you to judge whether the CAP theorem matters at all in practice!

Happy hacking!

@Ulf_Wendel Follow me on Twitter