PHP: parallel, background, asynchronous fetch

Given the very low number of issues reported to us, we can assume that the core features of mysqlnd have matured. Therefore, we spend some time discussing parallel, background and asynchronous fetch ideas for mysqlnd. All have been on our brainstorming list from the beginning but it was just too early to even discuss them before these days. None of the ideas is new in any way as a recent discussion on the PHP Internals list on parallel database queries has shown. However, people seem to expect different things from it and are not sure how to implement it. We trapped into this ourself when Andrey wrote a background fetch proof-of-concept patch recently. Back one step: what is this about, what do you want, please comment!

The status quo

PHP scripts are not threaded. A PHP scripts cannot follow two distinct execution paths at the same time, unless a PHP extension provides measures to do so. PHP itself does not support parallel activities. This implies that it is not possible to perform background operations. When issuing a database query PHP waits for the database server to execute the query. PHP is blocked. How long PHP is blocked depends on the database interface. When using MySQL, PHP is either inactive until the MySQL Server has calculated the query result (unbuffered results, Prepared Statements) or until the query result has been calculated and all query results have been transferred to PHP (buffered results). The illustration shows the most common case: mysqli_query() used with default settings which causes a buffered result set.

Buffered fetch in detail

The two main causes for wait situations in this setup are highlighted using a yellow star symbol.

  1. PHP is blocked during the query execution
  2. PHP is blocked until results have been fetched and decoded

Is fetching too fast?

Fetching and decoding results is a very fast operation most of the time. If not, your network connection is the bottleneck, but not PHP. Andrey has tried to optimize the fetch operation using a background thread. We tested a patched version of mysqlnd which features a dedicated IO thread to perform an asynchronous, non-blocking fetch. That means, even for buffered results, mysqli_query() returned immediately after the query execution but before fetching any data. Fetching data was performed by the IO thread in parallel to PHP script execution. Did it help? Not really. Often the IO thread was done before the Zend Engine was able to execute the statement which followed the mysqli_query() call. With other words: fetching is so fast that you will hardly be able to use the time frame of the fetch operation for parallel activities from a PHP programmers view.

Background fetch: too fast?

Judging from the very first attempts with background fetch: the benefit is that one might be able to close a result set on the server a little earlier. However, if its up to PHP to call mysqli_free_result() and the fetch has been performed even before PHP can call any of the mysqli_fetch_*() functions, you gain very little from it. Actually, so far we have not been able to find a benchmark which proves this technique to be a good one.

If you are curious to know how the API could be like, here you go. But as said, it seems not very promising on the first look.

$con = mysqli_connect(...);
	
// returns after query execution and before buffering
$res = mysqli_query($con, 'SELECT ...', MYSQLI_BG_STORE_RESULT);
	
// quite often fetching has been done before we reach this line...
print mysqli_fetch_row($res);
	
mysqli_free_result($res);
mysqli_close($con);

The fun is in the query runtime

A more promising approach is to concentrate on the query runtime. It can range from milliseconds to hours. But take something more typical, say your web site always runs three independent queries with execution times of 500ms, 1000ms and 1500ms each. Let it take 100ms each to process the database results. Using the current synchronous API you’re done after 3300ms and you have spend 3000ms waiting. Using an asynchronous API, you are done after 1600ms and you have spend 1300ms waiting. Put in different numbers and you have different relations between synchronous and asynchronous. However, the potential becomes quite obvious.

t (ms) Synchronous Asynchronous
0 Send query 1 (500ms) Send query 1 (500ms), 2 (1000ms), 3 (1500ms)
500 Process query 1 (500ms) Process query 1 (500ms)
600 Send query 2 (1000ms) Done with query 1 (500ms)
1000   Process query 2 (1000ms)
1100   Done with query 2 (1000ms)
1500   Process query 3 (1500ms)
1600 Process query 2 (1000ms) Finished
1700 Send query 3 (1500ms)  
3200 Process query 3 (1500ms)  
3300 Finished  

The need for distinct connections

The MySQL Client-Server Protocol has no build-in support for asynchronous query execution. The state transition is always SEND - RECEIVE. After sending a query you always have to fetch its result in the very next step. You can send multiple statements at once - mysqli_multi_query(). But this does not free you from having to do the fetch right after the send: SEND(q1, q2) - RECEIVE(r1, r2). Given this, you will understand that an asynchronous PHP API cannot be build around just one connection. Within one connection you simply cannot run queries in parallel. If you try so, you will always stack them up and you will end up with the already existing. You will end up with mysqli_multi_query().

No need for a background thread?

Assuming that the key is in reducing wait times for query results, judging from our first experiences with a background thread for fetching and keeping in mind that simple solutions are the best to start with, there is little need for a sophisticated background thread design. Power users might profit from it, but most users should be perfectly fine with a simple poll()-based approach. Even that might be of use for only few users.

Suggesting an asynchronous API

A possible API consists of a new flag for mysqli_query() and a new function: mixed mysqli_poll(array $result_handles, int $timeout [, array &$invalid_handles]). A new flag MYSQLI_ASYNC makes mysqli_query() return to the PHP script immediately after the query has been send to the MySQL Server. Like ever, mysqli_query() returns a result handle. Using mysqli_poll() the PHP developer can check if a result handle is ready to send data because the corresponding query has been executed on the server. The function takes a list of result handles and returns whichever handle is ready for processing. If no handle is ready after $timeout milliseconds, mysqli_poll() returns false. It might happen that a handle becomes invalid while you wait for the query execution. For example, the network connection might get closed. mysqli_poll() could, this is just a suggestion, indicate this through an optional third parameter which is passed by reference. Its a bit like exec() or system().

Here comes some pseudo-code to demonstrate the suggested API.

// for two parallel queries, we need two connections
$con1 = mysqli_connect(...);
$con2 = mysqli_connect(...);
	
$results = array();
// returns immediately
$results[] = mysqli_query($con1, 'SELECT ...', MYSQLI_ASYNC);
$results[] = mysqli_query($con2, 'SELECT ...', MYSQLI_ASYNC);
	
// housekeeping
$timeout = 100;
$invalid_handles = array();
$results_left = count($results);
$num_timeouts = 0;
	
do {
  // poll result handles for $timeout milliseconds
  // return as soon as a handle is ready for reading
  // report errors through third parameter - similar to exec()
  $res = mysqli_poll($results, $timeout, $invalid_handles);
	
  if (!empty($invalid_handles)) {
    // issues with one of the handles, e.g. connection down
    foreach ($invalid_handles as $key => $handle) {
      printf('Problem with handle %d - %s',
        $key, mysqli_error($handle));
      $results_left--;
    }
  }
	
  if (is_resource($res)) {
    // result handle $res has data
   while ($row = mysqli_fetch_assoc($res))
     print_r($row);
   mysqli_free_result($res);
   $results_left--;
 } else {
   // timeout
   $num_timeouts++;
   if ($num_timeouts > 3)
     break;
 }
	
} while ($results_left > 0);

Ok, where’s the code?

Sorry, no code. First, we are all busy with other assignments these days and it will be a major task to implement an asynchronous API. The blog describes a potential way for buffered “normal queries”. It does not touch unbuffered result sets and Prepared Statements or even multi query. We won’t be able to implement a good solution within days.

Second and foremost: what do you think about it? Please comment.

17 Kommentare to “PHP: parallel, background, asynchronous fetch”

  1. Thorsten schreibt:

    wow, this would be awesome for some of our customer projects! Please implement it!

    -Thorsten

  2. Hans Ramon schreibt:

    You guys from MySQL are awful: Not working on a good sever product only, also taking care of PHP api and it’s community. If you will make this happen, this will be the biggest add on for PHP and MySQL since years.

  3. Administrator schreibt:

    Ok, great. Good to hear back from community.

    It is a feature for only few users. But those users who deal with logical independent queries that have run times beyond a few milliseconds, are likely to see a notable boost as the example runtime calculation shows. However, we need to try it out to be sure. We put some hope in background fetching and the first prototype was sort of disappointing. Lets keep the fingers crossed its not the same with true asynchronous queries.

    So, when? As said earlier: its not on the high-priority list. Sorry! I really hope it does not take again a year before we can get back to this idea. The idea got on our list in November 2006. Now its November 2007. I’m not the person to decide, but I’d be surprised if it takes until November 2008.

  4. Mike schreibt:

    Hans: IT Is the greatest contribution to PHP and Mysql since years :)

  5. Andris schreibt:

    I suggest a simpler API: you give an array of SQLs to a function, mysql_paralell_query(), and it will only return if all queries have finished. The feature you’d like to implement is welcome, but it has to be simple to use.

  6. Administrator schreibt:

    Andris, what you are asking for might already exist: mysqli_multi_query - http://de2.php.net/mysqli_multi_query . Send multiple queries at once, get all result sets at once after all queries have finished (blocking, synchronous).

  7. Andris schreibt:

    I think mysqli_multi_query runs each query one by one on one connection, while mysql_paralell_query would run them in paralell, as you suggest in this article.

  8. Andris schreibt:

    …I see. It’s not exactly the same as you suggest, but almost. You would get all data after the longest query finished.

  9. Administrator schreibt:

    Hi,

    let’s recap for a second. There are two distinct techniques that can easily be mixed up especially when blogging about them with a misleading blog title: parallel, background, asynchronous execution.

    When sending requests parallel execution can happen on the side of the callee (e.g. MySQL Server) and the caller (e.g. PHP). Parallel execution is independent of the question whether the caller is blocked (synchronous) during the execution time of the callee or not (asynchronous).

    Parallel execution on the MySQL Server is possible using multiple connections. On each connection you can execute only one request per time. Thus, there is no parallel query execution within *one* connection with MySQL. And there’s nothing an API can do about that. Sure, MySQL can handle multiple queries in parallel, *but* this requires that you send each query on its own connection…

    Parallel execution in a PHP script is currently not possible. PHP does not feature parallel execution (ok, you can try to emulate using http requests to other scripts, but…). An extension writer can add parallel execution to PHP. An extension, which operates on the C level, can create threads. And the extension can do some work inside the new thread while PHP continues to run in its own thread. Using one additional thread is what I call background execution.

    By asynchronous I mean that a function returns immediately to the caller after sending a request to the callee. At any time the caller can poll the status of its requests and fetch the replies whenever wanted. The execution of requests on the callee does not block the caller.

    Now, put this together: mysqli_multi_query(resource $link, string $queries). This runs on *one* connection. Thus no parallel execution on the MySQL Server. This function *blocks* until all results are available, it is a *synchronous* API.

    And now my suggestion: resource $res1 = mysqli_query(resource $link1, string $query, MYSQLI_ASYNC); resource $res2 = mysqli_query(resource $link1, string $query, ASYNC); . In this example you send two queries to MySQL using two connections. As one uses two connections, MySQL can process the queries in *parallel* (remember: not possible with one connection). In order to be able to send two queries over two distinct lines the funtion call must not block. It must be an *asynchronous* function which returns to PHP immediately (before query execution has finished). If it wouldn’t, you wouldn’t be able to send two queries over two distinct lines at the “same time” and everything would be as its today: synchronous and also not parallel .

    What you might have asked for is a convenience function. Your suggestion is “I suggest a simpler API: you give an array of SQLs to a function, mysql_paralell_query(), and it will only return if all queries have finished.”. This would require a function like: mysqli_parallel_query(array of resources $resources, array of strings $queries). You want it to by *synchronous*. You do not want to have the possibility to do something meaningful while you wait for MySQL.

    That’s ok, that’s fine. However, I’m against it. You can easily write a userland function that does this for you:

    function my_mysqli_parallel_query($resources, $queries, $timeout = 0) {
    $results = array();
    foreach ($resources as $k => $resource)
    $results[$k] = mysqli_query($resource, $queries[$k]);

    $invalid_handles = array();
    $results_left = count($results);
    do {
    $res = mysqli_poll($results, $timeout, $invalid_handles);
    if (!empty($invalid_handles)) {
    // one of the queries has failed
    return false;
    }
    if (is_resource($res)) {
    $results_left–;
    } else {
    // timeout
    return false;
    }
    } while ($results_left > 0);
    return $results;
    }

    As your desired functionality is a subset of what I suggest, I vote for the original proposal. Let’s keep the API small, ext/mysqli has already many, many functions…

    The userland function my_mysqli_query(), however, is an excellent example for discussing the ideas behind the proposal.

    To sum up:

    -1 on API change

    +1 on adding your question/idea as an example to the manual
    +1 on writing more and better blog entries on the ideas/principles

    (This is why I really *appreciate* comments, although I might not agree)

    Ulf

    PS: Forget about ext/mysql. Its old, old, old, old, old, old, old, old, old and old.

  10. Andris schreibt:

    I see your point. I can do something else, while the queries go on. OK. But still a convinience function would be nice. :-)

  11. Sander schreibt:

    I would like such a feature very much! I have been waiting for it for years now…

  12. Sara Golemon schreibt:

    It’s long seemed a short-coming to me that PHP (or any client) should have to wait for a DB server to do it’s indexing and other work to find the answer on another system leaving the client machine idle.

    In fact, I implemented my own custom mysql client (connection / authentication / packet decoding / etc…) in order to get this for a work project… Of course, my implementation was much simpler than this would need to be as I had a defined requirement and was able to leave out several of the more advanced features that a given client may want to use.

    I’ve been wanting to port some of this into PHP for awhile, but (as you said) too many other projects are constantly competing. Let me know if you want a hand on the design and/or implementation.

  13. Administrator schreibt:

    Sara,

    let me point Andrey to you. He’s the one who has the technical excellence to review the design. I’m sure he will love to hear how its done, then try to explain it to me and I in turn will try to explain it to all the other users.

    May I ask what other features around the mysql client connector/driver/library you implemented and/or considered to implement? This would be highly valuable input.

    Ulf

  14. Sara Golemon schreibt:

    Might be a wording misunderstanding, but I didn’t implement anything *aournd* the library. I wrote an alternate implementation. (e.g. libmysqlclient is nowhere in the stack, nor is any odbc connector or anything else stock). My stuff is a home-brew implementation based on the protocol specs.

    Things I didn’t implement:
    * Encryption/Compression - My servers are link-local to the clients, no need for either of these features.
    * 4.0/3.x server support - My backends speak 4.1 protocol, so that’s what I coded to.
    * Multi-query support - My use case is 99% single query selects (and the occasional insert/delete/update)
    * Unbuffered queries - As a quirk of how the I/O library this fits into works, I actually *do* buffer the result set, but it’s buffered in parallel with other requests (which may or may not be going to the same backend)
    * Other features got left out too, though I can’t recall which ones atm

    Of course, this makes the code I have completely useless to libmysqlclient (or ext/mysqlnd) directly as it’s got a narrow set of assumptions, but the basic design is… well… just what you described (plus the technical bits of state management that you kinda glossed over for simplicity).

    Are you guys on IRC at all? We should bump heads… I was in Freenode/#mysql-dev while I was developing my client, but since wandered off. I’ll hang out there again for awhile…

  15. Administrator schreibt:

    Sure we can meet on IRC. See also my email. Maybe we should announce a meeting or use a MySQL university session for brainstorming. I’m very much interested in any input around this…

  16. Steven Roussey schreibt:

    It would be great to dispatch a query and forget about it… logging comes to mind (not all logging uses delayed inserts). It would be great to dispatch a group of queries and not wait.

    Also, and I mentioned this like 10 years ago to Monty, it would be nice to be able to run a query to a live server and a test server and have both run simultaneously so you could load test a setup from a production system without actually caring if the test db died, crashed, or had other issues (which you could test for later).

  17. Johannes schreibt:

    > it would be nice to be able to run a query to a live server and a test
    > server and have both run simultaneously

    Steven, that’s exactly one of the things MySQL proxy does and imo that’s the perfect place for this functionality.