XmlPrime has "Join Optimizations" listed on its feature list. What exactly are these join optimizations and how are they used?
The join optimizations performed by XmlPrime attempt to reduce the algorithmic complexity of complex queries. In theory you do not need to do anything to take advantage of these join optimizations - if your query contains a join-like construct then it should be optimized to a join. However it is good to be aware of the optimizations that are performed so you can understand why a join has not been spotted, or what joins are possible in XmlPrime.
Tuples
Consider the following query:
for $x in 1 to 10
let $y := $x * 2
return $y
An
algebra (at least for our purposes) is a system for modelling a query.
For most of the compilation process we use an algebra in which the previous expression looks something like this: (note that I have made up this syntax for the purpose of explaining the algebra)
For($x)
Range
Constant(1)
Constant(10)
Let($y)
Multiply
VariableReference($x)
Constant(2)
VariableReference($y)
The for expression enumerates over its domain and for each item sets a value of
$x
and evaluates its return expression (in this case the let expression).
When we perform join optimizations we first transform the query to a different algebra (a
tuple algebra). In the tuple algebra, each FLWOR clause is an expression that returns a set of
tuples. A tuple assigns values to a set of variables. For example
for $x in 1 to 10
returns a sequence of 10 tuples. The first one sets the value of $x to 1, the second tuple sets the value of $x to 2 and so on. Our previous example looks like this (again the notation is made up).
Return
Concatenate
For($x)
Range
Constant(1)
Constant(10)
Let($y)
Multiply
VariableReference($x)
Constant(2)
VariableReference($y)
The Concatenate expression above takes each tuple from the input, and then uses it to generate a sequence of tuples from the right hand input. In this case the value of
$x
from each input tuple is used to generate a single tuple that defines
$y
. Each tuple returned from the second input is combined with the tuple from the left input, and the result is combined. In this case for each tuple that defines
$x
we generate a tuple that defines
$y
and return a single tuple that defines both
$x
and
$y
.
Note that this change of algebra is merely a different way of describing the query, so far this hardly affects how the query is run. In practice "returns a tuple" means "sets some variables" and "combining two tuples" means "setting some variables, and then setting some other variables".
In a future version of XmlPrime we intend to eliminate the first algebra entirely and perform all our optimizations with this join algebra. A number of the new expressions added in XQuery 1.1, eg. group by fit a lot more cleanly with the tuple algebra.
We can perform certain optimizations in this algebra that are not possible (or at least very difficult) in the basic algebra above. These optimizations are the join optimizations.
Products
The simplest join optimization performed by XmlPrime is forming
products. A product is constructed when two independent expressions are concatenated. The expressions are
independent if none of the variables defined in the first expression are used in the second expression.
Here is an example:
for $row in /rows/row
for $column in /columns/column
return
<intersection row="{$row}" column="{$column}" />
In tuple algebra, this gets converted to something like the following:
Return
Product
For($row)
Step(child::row)
Step(child::rows)
Root
For($column)
Step(child::column)
Step(child::columns)
Root
Element(intersection)
Sequence
Attribute(row)
VariableReference($row)
Attribute(column)
VariableReference($column)
Below is an excerpt from the query plan, showing this product:
product
for $row in
step
$fs:dot_0/child::rows
child::row
with
for $column in
step
$fs:dot_0/child::columns
child::column
The original query plan looks like the columns would be navigated once for each row retrieved from the document. The product prevents this. When the product is evaluated it stores a list of all the tuples from the right hand side, and then each tuple in the left gets combined with each tuple from the right, without reevaluating the inner for loop. In practical terms, all the assignments to
$y
are recorded and replayed for each value of
$x
.
For this case the result can be mimicked by storing the definition of $y in a variable, but in the general case the product may not be storing just the values of a single for clause, but a whole set of FLWOR clauses.
Joins
A join is formed when a where clause depends on both sides of a product.
Here is an example query with a join:
for $author in doc("bookshelf.xml")/book/author
for $book in doc("bookshop.xml")/book
where $book/author = $author
return $book
Note that the where expression applies a predicate to a sequence of tuples returning only those which match. Since the two for expressions are independent they form a product, and since the where clause depends on both sides of the product, this is a join.
In this form joins are evaluated no differently to products with a where clause. This is known as a
nested loop join. The predicate is applied to the product of the tuples returned from the two for expressions. This gives quadratic performance.
In the query plan a join looks something like the following:
hash join
for $author in doc("bookshelf.xml")/book/author
to
for $book in doc("bookshop.xml")/book
on
$book/author = $author
Hash Join
If he predicate of a join is an equality operator the join then gets converted to a
hash join. The expressions on either side of the equality operator are known as the
keys. When the right (second) list of tuples is constructed, the right key is evaluated for each tuple. The tuples are stored in a Dictionary mapping from a given key value to a list of tuples that match this key (this dictionary is known as the
join index. Once this dictionary is built up then for each left tuple (from the first input), the left key is evaluated. This key is used to look up the matching tuples from the right hand side and and the results are combined and returned. Assuming that only a few tuples from the right hand side match each time then this reduces the complexity from
O(n2) to
O(n) (assuming constant overhead for a dictionary lookup).
There were many complications in implementing such a join, especially in the general case when the arguments have type xs:anyAtomicType*. If two types are incomparable, a type error should be raised. xs:untypedAtomic values need to be converted to the base type of every type that appears in the other key, and used for comparison. numeric types should be compared as if they had been promoted to the lowest base type. Numeric NaN values never match. If keys are sequences they may match more than one key in the index, but all the matching tuples must be returned once, and in the order they were encountered. Because of all these potential issues, there are several implementations of the hash join in XmlPrime which tackle different combinations of these issues. All the implementations have the same
O(n) basic performance characteristics.
Left Outer Joins
A left outer join is similar to a join, but the results are grouped for each input argument. Here is an example of a query with a left outer join:
for $user in doc("users.xml")/user
return
<user name="{$user/@name}">
{
for $comment in doc("comments.xml")/comment
where $comment/@user = $user/@name
return $comment
}
</user>
When converting this query into the join algebra, XmlPrime extracts each seperate FLWOR expression and puts it into a new let clause (with a new variable; in this case we will call it $comments), in order to group all FLWOR clauses together:
for $user in doc("users.xml")/user
let $comments := for $comment in doc("comments.xml")/comment
where $comment/@user = $user/@name
return $comment
return
<user>{$comments}</user>
The following pattern is then spotted as a left outer join.
[ClausesA]
let $a := [ClausesB]
where [Predicate]
return [AggregateExpr]
The above pattern is only a Left outer join if
ClausesA
and
ClausesB
are independent, and if
Predicate
depends
on both
ClausesA
and
ClausesB
. Note that in order to reach this condition some (let) clauses may need to be moved from
ClausesB
to
ClausesA
and some clauses may need to be moved from
ClausesA
and concatenated to the entire expression.
The left outer join is evaluated like a standard join, except for each tuple stored from the right hand side we also compute the aggregate expression (or if the join is evaluated lazily, the variables required to compute the aggregate expression) and then for each left hand tuple we find all the tuples that match it. We get (or compute) the aggregate value for these tuples, concatenate (this is normal sequence concatenation, not tuple concatenation) the aggregate values and then store the result in
$a
.
As with the standard join, both nested-loop and hash variants are supported. Again, in the case of the hash join, assuming only a few tuples match in each case, this reduces an
O(n2) operation to
O(n).
In the query plan a left outer hash join looks something like the following:
left outer hash join
for $user in doc("users.xml")/user
to
for $comment in doc("comments.xml")/comment
on
$comment/@user = $user/@name
aggregate
$comment
as $comments
Group-by
One particularly common form of the left outer join is the group-by. A group by clause is where you want to group all items in a sequence by a particular key. Here is an example which groups a list of books by their author.
let $books := doc("bookshelf.xml")/books/book
for $author in distinct-values($books/@author)
return
<author name="{$author}">{
$books[@author=$author]
}</author>
Using the join optimizations so far described this will optimize to a left outer join (and inventing a couple of variables,
$book
and
$matching-books
):
let $books := doc("bookshelf.xml")/books/book
left outer hash join
for $author in distinct-values($books/@author)
to
for $book in $books
on
$book/@author = $author
aggregate
$book
as
$matching-books
return
<author name="{$author}">{$matching-books}</author>
The group by is formed when you have a left outer join with the following form:
left outer hash join
for $x in distinct-values(keys($sequence))
to
for $y in $sequence
on
$x = key($y)
aggregate
$y
as
$group
In the above plan
keys
and
key
are general expressions depending on the variables shown. A left outer join is formed if applying keys to a sequence is the same as applying key to each item in the sequence and concatenating the result. Also with the current implementation key must return zero or one items.
Determining the condition above (applying keys to a sequence is the same as applying key to each item in the sequence) in general is not that easy, and our implementation is not yet perfect, although this improves with every release.
Note also that in practice the left outer join is liable to have many more clauses (in particular common subexpression elimination introduces a lot of let clauses), and these may have to be moved out of the join in order to get it into this form.
If XmlPrime does manage to manipulate the query to be in this form, and all the conditions are satisfied, then the query plan now looks like this:
for $book in doc("bookshelf.xml")/books/book
group by
$book/@author
aggregate
$book
as
$matching-books
return
<author name="{$author}">{$matching-books}</author>
The group by clause combines the join and the distinct values operation. It builds a join index, storing the aggregate values as it goes. Iterating over this join index then provides the values for the aggregate variable. Over the left outer join, this group by reduces the amount of memory used, and ensures that the input expression is evaluated only once. In the best case this can reduce the memory usage by more than half. This is because in the left outer join, the input expression is used twice and so all its values are stored in a variable. Also an additional hash table is built up to perform the distinct-values operation.
Conclusion
Hopefully this post has shed some light on the various join optimizations that are performed in XmlPrime. By examining the query plan of a query you can see which join optimizations have been applied.
Most of the improvements to join optimizations come from rewriting queries to make them closer to the forms above. If you have a query which you believe should contain one of the above optimazations, but does not then please let us know as we are keen to spot as many different ways of writing a join as possible.