Showing posts with label optimization. Show all posts
Showing posts with label optimization. Show all posts

Tuesday, 25 May 2010

Join optimizations in XmlPrime

XmlPrime has "Join Optimizations" listed on its feature list. What exactly are these join optimizations and how are they used?

The join optimizations performed by XmlPrime attempt to reduce the algorithmic complexity of complex queries. In theory you do not need to do anything to take advantage of these join optimizations - if your query contains a join-like construct then it should be optimized to a join. However it is good to be aware of the optimizations that are performed so you can understand why a join has not been spotted, or what joins are possible in XmlPrime.

Tuples


Consider the following query:

for $x in 1 to 10
let $y := $x * 2
return $y

An algebra (at least for our purposes) is a system for modelling a query.
For most of the compilation process we use an algebra in which the previous expression looks something like this: (note that I have made up this syntax for the purpose of explaining the algebra)

For($x)
Range
Constant(1)
Constant(10)
Let($y)
Multiply
VariableReference($x)
Constant(2)
VariableReference($y)

The for expression enumerates over its domain and for each item sets a value of $x and evaluates its return expression (in this case the let expression).

When we perform join optimizations we first transform the query to a different algebra (a tuple algebra). In the tuple algebra, each FLWOR clause is an expression that returns a set of tuples. A tuple assigns values to a set of variables. For example for $x in 1 to 10 returns a sequence of 10 tuples. The first one sets the value of $x to 1, the second tuple sets the value of $x to 2 and so on. Our previous example looks like this (again the notation is made up).

Return
Concatenate
For($x)
Range
Constant(1)
Constant(10)
Let($y)
Multiply
VariableReference($x)
Constant(2)
VariableReference($y)

The Concatenate expression above takes each tuple from the input, and then uses it to generate a sequence of tuples from the right hand input. In this case the value of $x from each input tuple is used to generate a single tuple that defines $y. Each tuple returned from the second input is combined with the tuple from the left input, and the result is combined. In this case for each tuple that defines $x we generate a tuple that defines $y and return a single tuple that defines both $x and $y.

Note that this change of algebra is merely a different way of describing the query, so far this hardly affects how the query is run. In practice "returns a tuple" means "sets some variables" and "combining two tuples" means "setting some variables, and then setting some other variables".

In a future version of XmlPrime we intend to eliminate the first algebra entirely and perform all our optimizations with this join algebra. A number of the new expressions added in XQuery 1.1, eg. group by fit a lot more cleanly with the tuple algebra.

We can perform certain optimizations in this algebra that are not possible (or at least very difficult) in the basic algebra above. These optimizations are the join optimizations.

Products


The simplest join optimization performed by XmlPrime is forming products. A product is constructed when two independent expressions are concatenated. The expressions are independent if none of the variables defined in the first expression are used in the second expression.

Here is an example:

for $row in /rows/row
for $column in /columns/column
return
<intersection row="{$row}" column="{$column}" />

In tuple algebra, this gets converted to something like the following:

Return
Product
For($row)
Step(child::row)
Step(child::rows)
Root
For($column)
Step(child::column)
Step(child::columns)
Root
Element(intersection)
Sequence
Attribute(row)
VariableReference($row)
Attribute(column)
VariableReference($column)

Below is an excerpt from the query plan, showing this product:

product
for $row in
step
$fs:dot_0/child::rows
child::row
with
for $column in
step
$fs:dot_0/child::columns
child::column

The original query plan looks like the columns would be navigated once for each row retrieved from the document. The product prevents this. When the product is evaluated it stores a list of all the tuples from the right hand side, and then each tuple in the left gets combined with each tuple from the right, without reevaluating the inner for loop. In practical terms, all the assignments to $y are recorded and replayed for each value of $x.

For this case the result can be mimicked by storing the definition of $y in a variable, but in the general case the product may not be storing just the values of a single for clause, but a whole set of FLWOR clauses.

Joins


A join is formed when a where clause depends on both sides of a product.

Here is an example query with a join:

for $author in doc("bookshelf.xml")/book/author
for $book in doc("bookshop.xml")/book
where $book/author = $author
return $book

Note that the where expression applies a predicate to a sequence of tuples returning only those which match. Since the two for expressions are independent they form a product, and since the where clause depends on both sides of the product, this is a join.

In this form joins are evaluated no differently to products with a where clause. This is known as a nested loop join. The predicate is applied to the product of the tuples returned from the two for expressions. This gives quadratic performance.


In the query plan a join looks something like the following:

hash join
for $author in doc("bookshelf.xml")/book/author
to
for $book in doc("bookshop.xml")/book
on
$book/author = $author


Hash Join


If he predicate of a join is an equality operator the join then gets converted to a hash join. The expressions on either side of the equality operator are known as the keys. When the right (second) list of tuples is constructed, the right key is evaluated for each tuple. The tuples are stored in a Dictionary mapping from a given key value to a list of tuples that match this key (this dictionary is known as the join index. Once this dictionary is built up then for each left tuple (from the first input), the left key is evaluated. This key is used to look up the matching tuples from the right hand side and and the results are combined and returned. Assuming that only a few tuples from the right hand side match each time then this reduces the complexity from O(n2) to O(n) (assuming constant overhead for a dictionary lookup).

There were many complications in implementing such a join, especially in the general case when the arguments have type xs:anyAtomicType*. If two types are incomparable, a type error should be raised. xs:untypedAtomic values need to be converted to the base type of every type that appears in the other key, and used for comparison. numeric types should be compared as if they had been promoted to the lowest base type. Numeric NaN values never match. If keys are sequences they may match more than one key in the index, but all the matching tuples must be returned once, and in the order they were encountered. Because of all these potential issues, there are several implementations of the hash join in XmlPrime which tackle different combinations of these issues. All the implementations have the same O(n) basic performance characteristics.

Left Outer Joins


A left outer join is similar to a join, but the results are grouped for each input argument. Here is an example of a query with a left outer join:

for $user in doc("users.xml")/user
return
<user name="{$user/@name}">
{
for $comment in doc("comments.xml")/comment
where $comment/@user = $user/@name
return $comment
}
</user>

When converting this query into the join algebra, XmlPrime extracts each seperate FLWOR expression and puts it into a new let clause (with a new variable; in this case we will call it $comments), in order to group all FLWOR clauses together:

for $user in doc("users.xml")/user
let $comments := for $comment in doc("comments.xml")/comment
where $comment/@user = $user/@name
return $comment
return
<user>{$comments}</user>

The following pattern is then spotted as a left outer join.

[ClausesA]
let $a := [ClausesB]
where [Predicate]
return [AggregateExpr]

The above pattern is only a Left outer join if ClausesA and ClausesB are independent, and if Predicate depends
on both ClausesA and ClausesB. Note that in order to reach this condition some (let) clauses may need to be moved from ClausesB to ClausesA and some clauses may need to be moved from ClausesA and concatenated to the entire expression.

The left outer join is evaluated like a standard join, except for each tuple stored from the right hand side we also compute the aggregate expression (or if the join is evaluated lazily, the variables required to compute the aggregate expression) and then for each left hand tuple we find all the tuples that match it. We get (or compute) the aggregate value for these tuples, concatenate (this is normal sequence concatenation, not tuple concatenation) the aggregate values and then store the result in $a.

As with the standard join, both nested-loop and hash variants are supported. Again, in the case of the hash join, assuming only a few tuples match in each case, this reduces an O(n2) operation to O(n).

In the query plan a left outer hash join looks something like the following:

left outer hash join
for $user in doc("users.xml")/user
to
for $comment in doc("comments.xml")/comment
on
$comment/@user = $user/@name
aggregate
$comment
as $comments


Group-by


One particularly common form of the left outer join is the group-by. A group by clause is where you want to group all items in a sequence by a particular key. Here is an example which groups a list of books by their author.

let $books := doc("bookshelf.xml")/books/book
for $author in distinct-values($books/@author)
return
<author name="{$author}">{
$books[@author=$author]
}</author>

Using the join optimizations so far described this will optimize to a left outer join (and inventing a couple of variables, $book and $matching-books):

let $books := doc("bookshelf.xml")/books/book
left outer hash join
for $author in distinct-values($books/@author)
to
for $book in $books
on
$book/@author = $author
aggregate
$book
as
$matching-books
return
<author name="{$author}">{$matching-books}</author>

The group by is formed when you have a left outer join with the following form:

left outer hash join
for $x in distinct-values(keys($sequence))
to
for $y in $sequence
on
$x = key($y)
aggregate
$y
as
$group

In the above plan keys and key are general expressions depending on the variables shown. A left outer join is formed if applying keys to a sequence is the same as applying key to each item in the sequence and concatenating the result. Also with the current implementation key must return zero or one items.

Determining the condition above (applying keys to a sequence is the same as applying key to each item in the sequence) in general is not that easy, and our implementation is not yet perfect, although this improves with every release.

Note also that in practice the left outer join is liable to have many more clauses (in particular common subexpression elimination introduces a lot of let clauses), and these may have to be moved out of the join in order to get it into this form.

If XmlPrime does manage to manipulate the query to be in this form, and all the conditions are satisfied, then the query plan now looks like this:

for $book in doc("bookshelf.xml")/books/book
group by
$book/@author
aggregate
$book
as
$matching-books
return
<author name="{$author}">{$matching-books}</author>

The group by clause combines the join and the distinct values operation. It builds a join index, storing the aggregate values as it goes. Iterating over this join index then provides the values for the aggregate variable. Over the left outer join, this group by reduces the amount of memory used, and ensures that the input expression is evaluated only once. In the best case this can reduce the memory usage by more than half. This is because in the left outer join, the input expression is used twice and so all its values are stored in a variable. Also an additional hash table is built up to perform the distinct-values operation.

Conclusion


Hopefully this post has shed some light on the various join optimizations that are performed in XmlPrime. By examining the query plan of a query you can see which join optimizations have been applied.

Most of the improvements to join optimizations come from rewriting queries to make them closer to the forms above. If you have a query which you believe should contain one of the above optimazations, but does not then please let us know as we are keen to spot as many different ways of writing a join as possible.

Friday, 18 September 2009

Unboxed evaluation

This is the third in a series of posts about XmlPrime 1.0 optimizations and performance.

In the previous post I outlined the compilation process used by XmlPrime. I did not however explain how the resultant syntax tree is evaluated.

In XmlPrime 0.9 every expression in the tree had several evaluate methods. Each of these methods took a dynamic context which contains everything needed to evaluate an expression, for example the values of the in-scope variables. These methods returned the result of evaluating the expression. The results returned were all of type XdmItem, and so an XdmItem instance had to be constructed for each item returned from an expression.

Boxing is a term that refers to wrapping a value or object in another object in order to pass it to something else. In .NET, boxing usually refers to wrapping a value type to expose it as a reference type, and happens implicitly in most .NET languages, for example when passing a double to a function that takes an object as an argument.

In XmlPrime 0.9 the values returned were boxed as XdmItems in order to return them, and needed to be unboxed when their contents was examined. In XmlPrime 1.0 we have managed to eliminate this boxing and unboxing in almost all cases at runtime.

XmlPrime already had some rudimentary unboxing for XPathNavigator values, and so the improvements are most noticeable on queries that use a lot of atomic values, for example a lot of arithmetic expressions. Take the following query for example, which finds the sum of the first 10 square numbers.

sum(for $x in 1 to 10 return $x * $x)

In XmlPrime 0.9 each number in the input sequence (1 to 10) is boxed. This item is then placed on the stack. Each reference to $x retrieved the variable from the stack. The multiplication operator took two items as arguments and unboxed them to obtain the values to multiply. The values were multiplied and then boxed again and returned. Finally all these returned values were unboxed, their values summed and the result boxed to return to the outside environment.

This is a total of 21 boxing and 30 unboxing operations for such a simple query.

In XmlPrime 1.0 the only place this boxing occurs is to return the final result from the query. As you can imagine the savings quickly add up.

Aside: Entries on the stack can be sequences of items and can be values or closures (expressions which are evaluated when they are first used). For this reason another kind of boxing still needs to take place when placing items on or retrieving items from the stack. Changing this to avoid first boxing all items that are to be placed on the stack proved suprisingly difficult. More information on how we managed this will be given in a future post.

The implementation details


In XmlPrime 1.0 we changed the evaluation processl; this is probably the biggest change since 0.9. The expressions themselves are no longer responsible for evaluation. Instead another compilation pass occurs (which we refer to as code generation) which transforms the syntax tree into an evaluation tree.

Optimized syntax treeCode generation
Evaluation tree


This code generation phase does not generate code in the traditional sense. Indeed this is something we would like to do in the future, in the vein of XslCompiledTransform, but this is not always practical. For now the evaluation tree consists of pre-written evaluators which consist of methods to evaluate the expression.

In XmlPrime 0.9 the base class for all items was called XdmItem, and all values passed around were wrapped in objects derived from XdmItem. All nodes in XmlPrime were represented as XPathNavigator instances, which had to be wrapped in NavigatorValue (which derives from XdmItem) instances in order to pass them around. The expressions had methods to evaluate themselves as lazy sequences (whose values are returned by an IEnumerator instance), or as single or optional XdmItem instances (which avoids the overhead of an enumerator.

Historically the cost of constructing NavigatorValues (which are effectively XPathNavigators boxed as XdmItems) showed up as a significant performance bottleneck, and so we added alternative methods to evaluate every expression that returned XPathNavigator values instead.

In XmlPrime 1.0 we alleviated the problem somewhat by renaming XdmItem to XPathAtomicValue, and deriving this from XPathItem (which is the base class for XPathNavigator). Since XPathItem is now used as the base class of our hierarchy, it means that XPathNavigator instances never need to be boxed.

The performance boost that was gained from unboxing XPathNavigators in XmlPrime 0.9 was huge, so we wanted to bring this performance enhancement to all items, and not just nodes. XmlPrime uses over 20 different types to represent values, so adding all these evaluation methods (remember that expressions can be evaluated to return a single item, zero or one items, or a sequence of items) would have been impractical. If we added any more methods of evaluation (for example eager evaluation) then this would also be fairly difficult.

The evaluation tree consists of objects implementing IEvaluator<T> which provides methods for evaluating a single item, zero or one items, or a lazy sequence of items. The interface is generic in the type of the values returned and so can be used for unboxed evaluation of most expressions. A lot of the evaluator implementations themselves are implemented as generic classes (for example the evaluator for fn:subsequence), which means that unboxed evaluation is used in almost all cases where the static type of an expression allows it.

Another advantage to seperating out the evaluation tree from the syntax tree is that it has lowered the bar for specialized versions of expressions, as now only a new evaluator class needs to be written, rather than an entire (rather complicated) expression class. When adding a new expression class we also had to ensure that it would not prevent any other optimizations. Adding a new evaluator has none of these difficulties. This has meant that we have been able to add more specialized versions of expressions to be used at run time (for example a typeswitch expression for when all the types being tested against are subtypes of node()?) without harming other optimizations.

Tuesday, 15 September 2009

The compilation process

This is the second in a series of posts about performance in XmlPrime 1.0. The previous post in the series can be found here.

In order to better explain some of the optimizations we perform, I will first outline the query compilation process.

XQuery sourceParse
Syntax treeBind & Type Check
Compiled syntax treeOptimize
Optimized syntax tree


Firstly, the XQuery source is parsed to create an syntax tree. The syntax tree is a tree representation of the query, which can be processed further. The syntax tree was constructed mostly following the rules provided in the XQuery 1.0 and XPath 2.0 Formal Semantics specification, although we have started to deviate from the formal semantics to provide better error reporting (by providing more context about where the expression came from) or for increased performance (for example making a general comparison an atomic expression). There is still a lot more we can do in this area (for example not expanding out filter expressions until after type checking has been performed, which reduces the number of expressions we need to create).

The second stage of compilation is binding and type checking. Because of the way namespaces are bound in XQuery we do not necessarily know the namespaces of qualified names as we parse them. Consider the following query:

<e attr1="{$foo:bar}" 
xmlns:foo="http://www.example.org/"
/>

When parsing this query we do not know which namespace the prefix foo refers to until we have parsed the namespace declaration. To avoid parsing the query multiple times, we bind variables and functions in a separate pass. The binding pass works out the in-scope namespaces for any namespace-dependant expressions, and binds variable references and function calls to the variables and functions they refer to.

Once the expression is bound, the static types of all expressions are computed, and where a static type does not match an expected type for an expression, either an error is raised (if the static typing feature is enabled), or an extra run time check is inserted as appropriate.

After binding and type checking, we are left with the compiled syntax tree, which contains all the information we need to run the query.

The final compilation stage is optimization. This stage attempts to transform the compiled syntax tree with the aim to making the end result more efficient. There are many optimizations that XmlPrime performs, too numerous to list here. Since this optimization stage is a large part of what makes XmlPrime so fast, we will outline many of the optimizations we perform in future posts to this blog.

You can see the optimization process in action by using the XmlPrime command line tool. The -e flag indicates that the query plan should be outputted. The query plan is a textual representation of the abstract syntax tree. By using the -c (compile but don't run) and toggling the -o0 (disable optimizations) flag you can see the query plan before and after the optimization pass has taken place.

Friday, 4 September 2009

XmlPrime 1.0 performance enhancements

This is the first post in a series of posts about the performance updates in XmlPrime.

First off, some numbers.

In order to demonstrate the performance changes I have run the 3 samples that we provide, with the command line tool in the XmlPrime 0.9 Technical Preview and the command line tool provided with XmlPrime 1.0. All these tests were run on an x64 machine with the x64 VM (under x86 the performance is much better - we will cover this in a later post).

The times are as reported by the command line tool, so the compile time includes parsing, optimization and compilation; the run time includes the time to evaluate the query and serialize the results.

Raytracer

The times here are from running ppm.xq of the final raytracer at a resolution of 640x480. Here is the command used to run this test:
XQuery -t ppm.xq width=640 height=480
compile(ms)run(ms)
0.91182412186
1.0163307593

Currency

These times were computed with a modified version of currency.svg that takes a source document as a context item, supplied as GBPNoon.xml. This is to avoid network latency from affecting the timings.

The command used to run this query was
XQuery -t -s GBPNoon.xml currencysvg.xq
compile(ms)run(ms)
0.911991344
1.064955

iTunes Artist Graph

The iTunes query used was the one from part 3 of the sample. This because the part 4 query is largely IO bound, as it retrieves a lot of data from Wikipedia.

The command used to run this query is:
XQuery -t iTunes3.xml
compile(ms)run(ms)
0.9145893483
1.08098529

This particular query runs slower in 1.0 than in 0.9 which we were not expecting. The reason behind this is that we changed some of the rules around variable dereferencing when we added support for modules to be compiled seperately. In this case global variables cannot be removed as they are part of the public interface of the query. Variables were being dereferenced in the main module, but their declarations were not being removed. Compounded with the fact that currently all global variables are evaluated eagerly meant that the source document was being validated several times. Suffice it to say after discovering this problem, the dereferencing issue has been fixed, and will be available in the next version available in the next couple of weeks. A run with the latest dev version brings the runtime down to 66884 ms.
[UPDATE (09/09/2009)]: These changes have been incorporated in XmlPrime 1.0.2

So how have we acchieved this performance boost? We have implemented a number of significant optimizations since 0.9. The biggest ones are listed here:
  • Unboxed evaluation of expressions
  • Improved strictness analysis
  • Unboxing variables
  • Static type analysis of recursive functions
  • Strictness and unboxing analysis of function arguments
  • Precompiling the DLL
  • Elimination of some quadratic algorithms during common sub-expression elimination
  • New XDM type system implementation, tighter bound to XML Schema
  • Function specialization
  • Specialization of core expressions
Most of these changes will probably sound quite cryptic, so this shall mark the start of a series of posts on how XmlPrime works under the hood, with in-depth articles on some of the different optimizations performed by XmlPrime.