Ulf Wendel

PHP: PECL/mysqlnd_ms 1.6 – automatic retry loop for transient errors

| 0 comments

PECL/mysqlnd_ms is client-side load balancing driver plugin for PHP MySQL that aims to increase distribution transparency when using any MySQL based cluster: failover, read-write splitting, abstraction on consistency (e.g. read-your-writes), partitioning/sharding support, … it’s all there. Until a few minutes ago, we had no special handling of transient errors. Sometimes a database server replies “come back in a bit and retry, no need to fail over yet”. And, that’s what the client shall do before giving up. PECL/mysqlnd_ms 1.6 (development version) is capable of hiding the retry loop, which makes it easier to use any existing PHP MySQL application with a cluster of MySQL servers.

Transient (temporary) errors are rarely observed with MySQL Replication but can be seen with MySQL Cluster. MySQL Cluster is an eager (synchronous) update anywhere (multi-master) cluster: all replicas accept reads and writes, replication is synchronous. See also the slide deck DIY: A distributed database cluster, or: MySQL Cluster for a brief introduction in distributed database theory relevant to MySQL users (presentation from the International PHP Conference 2013 Spring Edition).

Transient errors

MySQL Cluster scales well for write loads because it features transparent sharding (see slides). It automatically partitions data over multiple replicas. Over the time, for example, when adding replicas to the cluster, data may b redistributed. Rebalancing is an online operation, it does not lock out clients. Thus, you may observe a temporary error such as:

ERROR 1297 (HY000): Got temporary error 1204 'Temporary
failure, distribution changed' from NDBCLUSTER 


There may be other causes for temporary errors as well. In any case, its safe to ignore a 1297/HY000 and retry the command.

The latest versions of MySQL Cluster feature an implicit retry loop before returning the error to the client, if it is believed that your command is not time critical. Means, Cluster resends the command for you a couple of times with a short sleep period in between before returning control to the client to tell about the temporary problem. PECL/mysqlnd_ms 1.6 alpha got a similar loop: very basic and experimental. Here’s the idea.

Automatic retry loop

The dream of Andrey when he created PECL/mysqlnd_ms was to make using a cluster transparent. It should be possible to move an application from a single MySQL to a cluster of MySQL servers without code changes. Thus, as a first step, I have opted against offering a callback to decide on errors (like Connector/J does). Instead, it is possible to configure the retry loop in the config file.

The example config snippet instructs the driver plugin to start an implcitiy command retry loop when there is an error with the error code 1297. Its possible to configure a list of arbitrary error codes. Whenever 1297 happens, the command is retried for max_retries = 2 times. Between the retry attemps PECL/mysqlnd_ms 1.6 sleeps for usleep_retry = 100 milliseconds. In an ideal world, the temporary error is gone by the end of the wait loop. In the worst case of the error persisting, it is forwarded to whatever PHP API you use (mysqli, PDO_MySQL) leaving it to your application to deal with it.

{
  "myapp": {
    [...]
    "transient_error": {
      "mysql_error_codes": [
        1297
      ],
      "max_retries": 2,
      "usleep_retry": 100
    }
  }
}


Please, send us your feature requests: this is a "live report" from the hacking and nothing is set.

You can check whether an implicit retry loop has been performed by inspecting the statistics provided by PECL/mysqlnd_ms.

$stats = mysqlnd_ms_get_stats();
printf("Implicit retries to hide transient errors: %d", 
  $stats['transient_error_retries']);

Failover vs. transient error

When talking to a cluster instead of a single machine there are two additional error conditions to handle:

  • Permanent error: replica disappeared, forget about replica – for now: fail over to someone else…
  • Transient error: replica says BRB/BBIAB, retry – replica is synchronizing, data distribution changes, …

PECL/mysqlnd_ms applies failover logic whenever it connects to a replica. This can happen – due to lazy connect – not only when a connect() function of any PHP MySQL API is called but also during query(). At the time of writing, the retry loop is not applied for a connect attempt.

The new 1.6 transient error logic handles error conditions on already established connections. At the time of writing, it only covers query() – its a safe bet to assume that we cover all commands before the feature is called stable. Work in process, comments are welcome.

Happy hacking!

@Ulf_Wendel Follow me on Twitter

PS: The overdue 1.5 stable release is coming soon. We forgot about it, simple as that.

Leave a Reply

Required fields are marked *.