I love the new NoSQL systems: more choices! After years of RDBMS dominance there are hundrets of NoSQL systems offering a wide range of data models, data distribution strategies and interfaces. Polyglot persistence describes the market change. I am most fascinated by document stores: nested data and data distribution go hand-in-hand. Nested data, finally. And, for those who like it: schemaless or even schemafree. Maybe something to learn for MySQL? But their search capabilities… A word or two on SQL (SELECT … FROM … WHERE – SFW) and nested data.
Learn from NoSQL document stores
The classical relational data model requests all data to be in first normal form (1NF):
A relation is in first normal form if the domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain.
Simply put: only scalar values are in the columns of table allowed. Take the popular example of a web application implementing a blog system. Let a blog article consist of an id, a title, the blog content and a list of comments with each of it consisting of an id, an author and a message. Using the classic models two tables are needed: blogposts(id, title, content)
and comments(id, author, message, fk_blogpost_id)
. Because only scalar values can be stored in a table, the list of comments has to go in a second table. A foreign key links the two tables.
blogposts | ||
---|---|---|
id | title | content |
1982 | Searching document stores… | I love the new NoSQL systems: more choices! … |
comments | |||
---|---|---|---|
id | author | message | fk_blogpost_id |
1 | Andrey | Hi Ulf, … | 1982 |
2 | Johannes | Ahoi Ulf, … | 1982 |
A NoSQL document store allows using non-scalar values. Thus, comments belonging to a blogpost can be saved in a nested fashion: blogpost(id, title, content, comments(id, author, message))
. Here’s a citation from 1986 transferring it into the world of relational databases:
…, an NF2 relation is a set of (equally structured) tuples (or records), the fields (attributes) of which contain either atomic values (e.g. numbers, strings) or relations. The latter may be flat or NF2 relations in turn.
As will be shown, the non-first normal form (NF2 or N1NF) data model is not as academic as it may seem on the first look.
blogposts | |||||
---|---|---|---|---|---|
id | title | content | comments | ||
id | author | message | |||
1982 | Searching document stores… | I love the new NoSQL systems: more choices! … | 1 | Andrey | Hi Ulf, … |
2 | Johannes | Ahoi Ulf, … |
Naturally, the 1980s usage examples are not about blog software. Back then, the examples have been around CAD/CAM, forms, scientific and engineering data. The NF2 data model and a modern document store share some advantages:
- Semantically related data is not split into different tables.
- On disk: potentially less joins and disk seeks, thus quicker search. Related: Akiban.
- Distribution (e.g. sharding): potentially less cross-shard queries as basic queries may not spawn multiple nodes. Related: MySQL Cluster condition push-down, …
- Hierarchical structures often found in reports (master-detail view) are matched.
- 1:n relationships are easily modeled
[e]NF2 and JSON stores similarities
Todays NoSQL document stores go beyond yesterdays NF2 data model. Often, JSON respectively BSON is used as a data serialization format. This is a natural choice for a web database. JSON and JavaScript go perfectly together. JavaScript gets more and more popular, for example, to create portable applications for mobile devices. But also other languages support the lightweight data encoding format.
JSON supports scalar and non-scalar data types. The scalar types are: string, number, bool and null. The non-scalar types are array and object. With the exception of an array, counterparts can be found in classic NF2 models, as sketched above. Extended NF2 proposals exist that cover all JSON types. SQL:1999 introduces ARRAY
(as of SQL:2003 multi-dimensional) as well as ROW
. SQL:2003 adds MULTISET
. The similarities between a JSON based document store and SQL based [e]NF2 remain obvious, which is important when getting deeper into algebra (not for me 🙂 !), query languages and such.
Data types | |
---|---|
JSON | SQL |
Scalar / atomic | |
string | SQL:92 – CHARACTER, CHARACTER VARYING, SQL:1999 – CLOB |
number | SQL:92 – NUMERIC, DECIMAL, INTEGER, SMALLINT, FLOAT, REAL, DOUBLE PRECISION |
true, false | SQL:1999 – BOOLEAN (TRUE, FALSE, UNKNOWN) |
null | SQL:86(?) – NULL |
Non-scalar / composite / collection | |
object | SQL:1999 – ROW |
array | SQL:1999 – ARRAY (ordered), SQL:2003 – MULTISET (unordered), see notes below! |
Below is a concrete example for the blog post application data modelling using a JSON document and a SQL:2003 style table. Once you look beyond MySQL, you can find relational databases implementing these or similar features. For example, Oracle has a variable-length VARRAY type, Postgres offers variable-length multidimensional ARRAY at least since 7.1, DB/2 uses…, Informix has…
JSON |
---|
|
SQL:>=2003 |
|
Please note, I have choosen an example which gives the impression that there is a direct mapping between a SQL:2003 table and arbitrary JSON documents. This is not the case. The SQL:2003 collection data types ARRAY
and MULTISET
are typed. Imagine, some comments had a created TIME
field, others had not. Then, two distinct ROW
declarations would be required. For the rest of the post, I’ll ignore the glory schema question.
SQL:1999 "the type monster" versus schemaless
The major difference between a JSON store and a SQL:1999+ relational database is on the role and importance of a schema. Be it a table or a composite type everything in a relational database follows a previously defined schema. The more you walk along the SQL:1999, SQL:2003, SQL:2008 road, the complexer and stricter types (and schemata) get: object orientation, XML, … Not so in many NoSQL document stores: each and every document can have a different schema, schemas are not declared upfront. Simple thing: simple to learn, simple to use yet powerful.
Martin Fowler has published a great presentation discussing schemaless data structures last week. It’s as good as his book NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence (strong buy for beginners).
Like the schemaless nature of a JSON document store or not. Fact is, JSON based document stores have gained significant popularity and market share. Fact is, if your queries match the sweet spot of the data model (queries do not spawn multiple documents), you need sharding, and the system does not mess to badly with schemaless disadvantages (e.g. storage requirements) you get a winning combination.
It is so appealing! BUT the query language…
Speaking of JSON, documents and sharding what could be more logical than doing key lookups (shard key, distributed systems, ….) or using JavaScript expressions to search the data store. As a web developer you do know JavaScript inside-out anyway. Possibly, the search is enriched with MapReduce (sharding, distributed systems… ).
Key lookups are unquestionable no rich query language. Whether the JavaScript expressions available to are rich and powerful, has to be checked on a case by case basis. For example, I have my doubts (english version, german version) about the MongoDB query language. MapReduce is certainly great for distributed systems but only for those. Olaf Bachman, who is managing Google’s European ads abuse engineering efforts, describes in his talk NoNoSQL@Google how his team ended up building SQL-like query capabilities for the various NoSQL solutions used.
One does not have to be a computer specialist to understand SFW basics- Read: sales and marketing can do basic analysis themselves without bothering you, the developer. Virtually ever developers has learned SQL at some point, thus low training costs for you, the manager.
SQL is a declarative language. You say what data you want and not how to access it. The data store will find an access path, an optimizer component can speed it up. Whereas with MapReduce or, for example, a JavaScript based query language, the developer has to implement an access path. Listen to NoNoSQL@Google: are you an expert in developing distributed [MapReduce] programs?
The roots of SQL for [e]NF2
Given the link between a simple JSON centric data model – [e]NF and the unquestioned advantages of SQL, it is time to do a time ride back to the 1980s but skip 1977. We are on the hunt for SQL to query [e]NF2. Let the joy ride begin with the paper Die NF2-Relationenalgebra zur einheitlichen Manipulation externer, konzeptueller und interne Datenstrukturen by Hans-Jörg Schek and Marc H. Scholl published in 1983. The discussion of a relational algebra was quickly followed by prototype systems and query language proposals, for example:
- M.A. Roth, H.F. Korth, and D.S. Batory.
SQL/NF: A Query Language for N1NF Relational Databases, 1985 - P.Pistor, R.Traunmuller.
A Database Language for Sets, Lists and Tables, 1985
Unrelated: MySQLers remember developer meeting in Heidelberg? This became HDBL – Heidelberg Database Language. No kidding. - P.Pistor, F. Andersen.
Designing a Generalized NF2 model with an SQL-type language interface, 1986 - Per-Ã…ke Larson.
The Data Model and Query Language of LauRel, 1988
After being a hot topic in the mid 1980s, research focus seems to have shifted towards object oriented data bases.
Most NF2 algebras propose NEST and UNNEST operations. Most SQL-style language proposals seek to improve database language orthogonality, see also Some principles of good language design: with especial reference to the design of database languages by C.J. Date (1984). Traces of both can be found in SQL:2003.
Nested relations, nested expressions…
The SQL/NF paper notes that a NF2 data structure allows a relation wherever a scalar value can occur in a 1NF data structure (orthogonality) [1]. They continue that SQL has a closure property where the result of any query on one or more relations is itself a relation [2].
SELECT attribute-list | |
FROM relation-list | |
WHERE predicate |
Thus, the language shall allow using a SFW-expression in the attribute-list (because of [1]) and in the relation-list (because of [2]) of a SFW-expression.
blogposts | |||||
---|---|---|---|---|---|
id | title | content | comments_allowed | comments | |
author | message | ||||
1982 | Searching document stores… | I love the new NoSQL systems: more choices! … | true | Andrey | Hi Ulf, … |
Johannes | Ahoi Ulf, … | ||||
2013 | Solution for… | MySQL 5.6 introduces… | true | ||
2012 | … and a Happy New Year! | Picture taken… | false |
Given the above nested blogpost table, the proposal is that we can filter the nested comments using nested SFW:
SELECT title, (SELECT * FROM comments WHERE author = 'Johannes') FROM blogposts
The above query would return all blogposts including those without any comments but show only comments from Johannes, if any. Andreys’ comment on "Searching document stores…" would be excluded. To show only Johannes comments’ and retrieve only those blogposts commented by Johannes one adds a predicate:
SELECT title, (SELECT * FROM comments WHERE author = 'Johannes') FROM blogposts WHERE (SELECT * FROM comments WHERE author = 'Johannes')
The result of this query would be:
title | comments | |
---|---|---|
author | message | |
Searching document stores… | Johannes | Ahoi Ulf, … |
The userfriendliness of the query can be improved removing the need to repeat the nested SFW-expression. The various papers from back then use different shortcuts. Let’s do a modern mash-up using SQL:1999 WITH
style:
WITH author_search AS (SELECT * FROM comments WHERE author = 'Johannes') SELECT title, author_search FROM blogposts WHERE author_search
BTW, an Oracle colleguage (not from MySQL) noted in the current issue of the german computer magazine iX that MySQL lacks WITH
…
I am aware of jumping around between old and new syntax elements. However, I am not after proposing a concrete language – I better leave that to the experts. I try to take a users view on what could possibly be done. The early proposals appeal to me… One more query to conclude the paragraph with SQL:2003 style syntax elements: show all blog posts with together with the first comment, if any.
SELECT title, comments[0] FROM blogposts
SQL:2003 allows accessing ARRAY
elements using an offset, like in many other programming languages.
Nesting and unnesting
To transform nested into flat relations and vice versa NEST and UNNEST operations are required. Formulating them explicitly and introducing new operators might contribute to language clearity. (Please note, that below example uses PNF.)
|
|||||||||||||||||||||||||||||
v NEST / UNNEST ^ | |||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Unnesting is also useful for use of nested relations with function that expect scalar values. What should SUM((price, description), (price, description))
return? Recall that we can have non-scalar wherever scalar values used to be.
SQL:2003 ?
To be honest, I have not written a single SQL query of higher complexity than SELECT 1 FROM DUAL
for years. SELECT 'Hello world'
was I needed for testing PHP stuff. Be gentle with my lame examples inspired by what I found in my bookshelf. My bookshelf has a copy of SQL 1999 und SQL 2003: Objektrelationales SQL, SQLJ und SQL/XML by Can Türker. (Aside and BTW: Can Türker is an advisor of ETZ Zurich Database Research Group together with H.J. Schek.)
For all SQL:2003 examples, I’ll use a slightly modified table definition than shown above. I do not use ARRAY
to match JSON syntax but MULTISET.
CREATE TABLE blogposts ( id INTEGER, title VARCHAR(100), content CHARACTER LARGE OBJECT, show_comments BOOLEAN, comments ROW( author VARCHAR(100), message CHARACTER LARGE OBJECT ) MULTISET );
To get started we are looking for all blogposts commented by Johannes. For each one found, we want all the comments, including Andreys’ comments.
SELECT * FROM blogposts WHERE EXISTS( SELECT * FROM UNNEST(comments) c WHERE c.author = 'Johannes');
Find all blog posts titles together with all comments but show only comments from Johannes, if any:
SELECT title, MULTISET( SELECT c FROM UNNEST(comments) c WHERE c.author = 'Johannes' ) FROM blogposts
Compared to the early SQL/NF style examples the syntax is quite verbose because a type constructor (MULTISET
) us used with the inner query. If I’m not mistaken, such explicit type constructors can also be found in HDBL from 1985/1986. But instead of having to type the eight characters of MULTISET
the constructor names are shorter like <>
, {}
…
SQL:2003 TABLE
constructor
… speaking of constructors. A new TABLE
constructor has been added to SQL:2003 which will become handy in a second. The constructor makes it possible to use a functions return value like a table. For example, there could be a function random_greetings
returning (rnd, greeting)
tuples. By help of TABLE
it can appear in the relation-list of an SFW-expression:
SELECT rnd FROM TABLE(random_greetings(100))
Back to JSON
Time to recap, time to attempt to crack the nut.
More than 30 years ago, the SQL community has looked into breaking out of the limitations of "flat" tables (first normal form). Algebras for "nested" tables (non-first normal form) have been developed, followed by SQL language extension proposals. Some 10 years ago, SQL:2003 has incorporated all or some of the ideas – please, find an expert to elaborate on this.
SQL tables can hold nested data, JSON documents can hold nested. SQL types almost match JSON types. The troublemaker here is a JSON array. A JSON array is a sorted list of values of arbitrary types (types can vary within one array). A SQL array is a sorted list of values of one type.
JSON | SQL |
|
|
---|---|
|
|
Likely, one would have to introduce a new sorted collection type to overcome this gap. Or, would we rather need a loosely typed scalar? Whatever, it should be possible to use (almost) standard SQL:2003 tables to store JSON documents in a "native" way. JSON would be nothing but a fancy data serialization format. Such a nested table could equally well be used to store and output data as XML.
Result: MySQL as a schemafull document store – with a proper query language.
Finally, the schemaless issue
The remaining nut to crack is schemaless. Like it or not, people use it. Tell them it has drawbacks, they still use it and want it. A question to the experts: is it a non-issue for an SFW-expression?
SFW can deal any nested table. Any collection of (schemaless) documents can be mapped into a list of (schemafull) tables. In the worst case the collection is mapped into as many different tables as there are documents. However, SFW can deal with any table.
It is not possible to create a table that can all possible documents of a collection. But, we can create a table and dump all JSON documents in a BLOB
. The SQL:2003 TABLE
constructor can be used to create tables, on which we can perform queries, based on a functions return value. Is this possible:
SELECT id, title FROM TABLE(json(<blob_column>))
Result, if possible: MySQL as a schemaless document store – with a proper query language. I would definetly prefer that over any UDF based approach such as:
SELECT CAST(JSON_DECODE("id", <blob_column>), INT) AS id, JSON_DECODE("title", <blob_column>) AS title FROM <blob_column>
The sketched NF2/SQL:2003 road would allow storing nested data in a schemafull and possibly also schemaless way. Query capabilities would be great. Likely, the BLOB
hack is somewhat hidden from the user.
Happy hacking!
Pingback: Open Sourcing Akiban | The Akiban Blog - Database Flexibility without Compromise
Pingback: MySQL 5.7: SQL functions for JSON overview | Ulf WendelUlf Wendel
Pingback: MySQL Fabric's initial decision on data models and transactions put into perspective | Ulf WendelUlf Wendel