Ulf Wendel

PHP: How mysqlnd async queries help you with sharding!

Recipe for conference talks: add a buzzword like "sharding" and show how mysqlnd helps you with asynchronous queries. Only two PHP database extensions give you asynchronous queries: mysqlnd and Postgres. Slides (OpenOffice) from the International PHP Conference 2008 have the details about the asynchronous query API in ext/mysqli (using mysqlnd).

Mysqlnd Async Ipc2008
View SlideShare presentation or Upload your own.

Asynchronous, non-blocking IO is an old favourite. Although its an old idea, and it is a well known technology that is supported in many environments, it is still new for PHP database extensions. A year ago, I blogged about parallel and asynchronous queries in mysqlnd. Shortly after we had it implemented but never tested it and therefore never pushed it forward. Parallel queries, using a background thread in PHP, did not turn out to fly. But asynchronous queries have some potential.

synchronous vs. asynchronous

While writing these lines, I am listening to a talk of David Sorria Parra about sharding. Sharding is an attempt to overcome the limitations of single database servers. Its about horizontal scaling over many machines by splitting your schema and distributing load over many machines. Sharding has a huge number of challenges involved: joins, unions, intersections, grouping, selection on groups, aggregation, primary key, referential integrity, (de-)normalization – plenty of things work different if you split your schema over multiple machines.

One thing is for sure about sharding: sometimes you will have to query multiple servers to manually compute, for example, an aggregation. Because there is no parallel execution in PHP, you do all the queries sequentially. Query wait times sum up. Not with mysqlnd: send out all the queries, do not wait for the results, do something useful and fetch results when available.

$all_links = array($shard1, $shard2, $shard3);
$processed = 0;
do {
	$links = $errors = $reject = array();
	foreach ($all_links as $link)
		$links[] = $errors[] = $reject[] = $link;		
	if (0 == ($ready = mysqli_poll($links, $errors, $reject, 1, 0))
		continue;	

	foreach ($links as $k => $link) {
		if ($res = mysqli_reap_async_query($link)) {			
			mysqli_free_result($res);
			$processed++;

		}
	}
} while ($processed < count($all_links));

Judging from the feedback I got during the talk, its time to publish the asynchronous API. Though, we need to cut off some edges. But that is a matter of days or weeks. Its not a matter of months or years. For now, please find more technical details in the slides.

A last quick comment from the conf. After one year of yelling at Andrey, Johannes gave me an experimental mysqlnd branch which allows you to plug-in a PHP Stream filter into the communication between mysqlnd and the MySQL Server. In other words: you could write a transparent proxy and do stuff like query rewriting, caching and so on inside your PHP application. However, that is evil and unstable – let us talk about it next year again…

2 Comments

  1. Just to put oil the fire, I had experimented yesterday with batching of asynchronous queries. What is this? The implementation which Ulf tested doesn’t allow to send second (non)-async query before reaping the result from the previous async query. Well, I modified mysqlnd to overcome this limitation. Why? Because I wanted to pump as much data (INSERTs) as possible to MySQL and then reap the results. What happened?
    If I pump everything and then start reaping there was like 5% performance gain. Not much.
    Here are some results:
    (flag=1-async queries, flag=0 sync)
    Rows in the table 20000
    Flag=0 Total: 3.1503
    Flag=1 Total: 2.8451

    Flag=0 Total: 2.9478
    Flag=1 Total: 2.8593

    As you see. Doesn’t bring much. 20k lines are inserted in a loop. One value at a time.

    Then, I decided to see if multi-value inserts will bring much. It didn’t…Well, looking at how to find where most time is spent, I added one one microtime() in my benching script. …and what a surprise. Pumping data to the MySQL server is extremely quick, with ASYNC! Most of the time is spent on waiting for the result. Let me show you some results:
    (flag=1-async queries, flag=0 sync)
    Rows in the table 40000
    Flag=0 Total: 3.3813 Query=3.3812
    Flag=1 Total: 1.5787 Query=0.1828

    Flag=0 Total: 3.1936 Query=3.1935
    Flag=1 Total: 1.4711 Query=0.1130

    Do you see it? Query time is just 18/338 ~= 5%!!! 5% is consumed for pumping data into MySQL, which was running on /dev/shm, so no IO problems. You see also two different totals, which differ by about or more than 50%.

    I did see even better results than 50%, about 60-70% lower runtime, because there is a sleep of few seconds. Wall clock time wasn’t much different but being able to sleep, means that you can re-use the time for something else…like using second connection for pumping even more data. At the end, using 2 or 3 connections you can pump twice or triple of the data.

    BEWARE: it seems it depends on how big your statements are. Multi-value statements of up to 100 values did work well, but going over lowered the performance. So, take this numbers with a bit of a salt, but know that there is a potential there.

    My script:
    query(“truncate tint”);
    $start = microtime(1);
    for ($i=0; $i query($insert, $flag? MYSQLI_ASYNC:0);
    }
    $mid = microtime(1);
    sleep($sleep);
    if ($flag) {
    for($i=0; $iquery($q)->fetch_object()->count);
    if ($flag) {
    echo “\n”;
    }
    }

    ?>

  2. I have to say that all this looks very promising, any chance to see any of that asynchronous goodness in PDO, even in a hackish way, to work around PDO limitations?