What if PECL/mysqlnd_ms does not implement failover?

2011/02/24 by admin

Implicit and automatic connection failover done by the driver is broken by design, correct? Should PECL/mysqlnd_ms implement automatic failover? Today Johannes showed me an IRC snippet about TAF (Transparent Application Failover). Maybe, I fail miserably but I don’t see a need for automatic failover in a master slave replication loadbalancing plugin. Let the application do it. In the end, this is also what TAF does – via a callback.

Say you run a query. The master slave load balancing plugin recognizes that it is a read query, a SELECT. The plugin picks a server for query execution, for example, the first of the configured slaves. The plugin sends the query to the slave but the slave is not reachable. Today, PECL/mysqlnd_ms should report an error. The plugin will not try to failover to another server.


/* A function from your database abstraction layer */
function my_query($mysqli, $query, $retry_count = 0) {

 if ($retry_count > 3)
    return false; 

 if (!($ret = $mysqli->query($query))) {
   /* Error: 1053 SQLSTATE: 08S01 (ER_SERVER_SHUTDOWN)  */
   if (1053 == $mysqli->errno) {
     /* ouch, slave is not available, try next... */
     return my_query($mysqli, $query, ++$retry_count);    
   } 
 }

 return $ret;
}

/* myapp refers to a group of master/slave servers */
$mysqli = new mysqli("myapp", "user", "password", "database", "port");
my_query($mysqli, "SELECT * FROM table");

(Remember that with PECL/mysqlnd_ms $mysqli represents a connection pool)

With today’s proof-of-concept your application is responsible for handling the unavailability of a server. Your application must handle database errors anyway: even MySQL can fail, believe it or not. In the example, I catch error 1053 (ER_SERVER_SHUTDOWN) as a possible hint that a slave has become unavailable. If so, I’m running the same query again. The plugin will – depending on the configured load balacing mode – pick another server and, run the query on, say, the second configured slave. A clean, simple design, no spaghetti. The application knows what to do, application decides.

Again, how transparent can (shall) the plugin be?

Once could argue that the plugin is no longer truly transparent. True, the application must be aware of replication being used and the application must have code handling what to do if an automatically selected slave server becomes unavailable. But the plugin cannot be 100% transparent anyway, as explained earlier in PHP: Master slave load balancing in mysqlnd – how transparent?. For example, the plugin cannot track user-defined SQL variables and ensure that those SQL variables have the same value on all configures MySQL servers.

Instead of giving up early we could also implement the above PHP logic (or a similar one) in C and do automatic failover within the plugin. But then, what will happen with this code sequence if there is a need for implicit failover when running the third query? When running the third query it is not allowed to switch the server, one must not break apart the unit of work. Instead of a unit of work constructed with SQL hints , I could have also used a SQL transaction to show my point.


$mysqli->query(sprintf("/*%s*/SET @a=1", MYSQLND_MS_SLAVE_SWITCH));
$mysqli->query(sprintf("/*%s*/SET @a=@a+1", MYSQLND_MS_LAST_USED_SWITCH));
$mysqli->query(sprintf("/*%s*/SELECT @a as _a", MYSQLND_MS_LAST_USED_SWITCH));

Support loves you: a manual full of execeptions…

If automatic failover cannot be handle a situation in a transparent way, it needs to be documented. I fear the list of exceptions will be long. People will be trapped, support will be busy (good for the business!) and in the end the recommendation from support to the customer will be to turn off automatic failover.

Let’s not do it then. Let’s educate the users to do failover themselves. Let’s not have callbacks like with TAF that move error handling (= failover) from where it belongs (in your app, where you run the query) to a callback which you attach afterwards to your app because you do not want to refactor it for its new duties. Let’s just accept that some things must be handled by the application. Such as failover.

I’m curious to read the blog comments… maybe I’ll be called crazy, … maybe some say I’m that wrong its not worth commenting…