The XmlPrime Blog

Tuesday, 25 May 2010

Join optimizations in XmlPrime

XmlPrime has "Join Optimizations" listed on its feature list. What exactly are these join optimizations and how are they used?

The join optimizations performed by XmlPrime attempt to reduce the algorithmic complexity of complex queries. In theory you do not need to do anything to take advantage of these join optimizations - if your query contains a join-like construct then it should be optimized to a join. However it is good to be aware of the optimizations that are performed so you can understand why a join has not been spotted, or what joins are possible in XmlPrime.

Tuples

Consider the following query:


for $x in 1 to 10
let $y := $x * 2
return $y

An algebra (at least for our purposes) is a system for modelling a query.
For most of the compilation process we use an algebra in which the previous expression looks something like this: (note that I have made up this syntax for the purpose of explaining the algebra)


For($x)
 Range
   Constant(1)
   Constant(10)
 Let($y)
   Multiply
     VariableReference($x)
     Constant(2)
   VariableReference($y)

The for expression enumerates over its domain and for each item sets a value of $x and evaluates its return expression (in this case the let expression).

When we perform join optimizations we first transform the query to a different algebra (a tuple algebra). In the tuple algebra, each FLWOR clause is an expression that returns a set of tuples. A tuple assigns values to a set of variables. For example for $x in 1 to 10 returns a sequence of 10 tuples. The first one sets the value of $x to 1, the second tuple sets the value of $x to 2 and so on. Our previous example looks like this (again the notation is made up).


Return
 Concatenate
   For($x)
     Range
       Constant(1)
       Constant(10)
   Let($y)
     Multiply
       VariableReference($x)
       Constant(2)
 VariableReference($y)

The Concatenate expression above takes each tuple from the input, and then uses it to generate a sequence of tuples from the right hand input. In this case the value of $x from each input tuple is used to generate a single tuple that defines $y. Each tuple returned from the second input is combined with the tuple from the left input, and the result is combined. In this case for each tuple that defines $x we generate a tuple that defines $y and return a single tuple that defines both $x and $y.

Note that this change of algebra is merely a different way of describing the query, so far this hardly affects how the query is run. In practice "returns a tuple" means "sets some variables" and "combining two tuples" means "setting some variables, and then setting some other variables".

In a future version of XmlPrime we intend to eliminate the first algebra entirely and perform all our optimizations with this join algebra. A number of the new expressions added in XQuery 1.1, eg. group by fit a lot more cleanly with the tuple algebra.

We can perform certain optimizations in this algebra that are not possible (or at least very difficult) in the basic algebra above. These optimizations are the join optimizations.

Products

The simplest join optimization performed by XmlPrime is forming products. A product is constructed when two independent expressions are concatenated. The expressions are independent if none of the variables defined in the first expression are used in the second expression.

Here is an example:


for $row in /rows/row
for $column in /columns/column
return
 <intersection row="{$row}" column="{$column}" />

In tuple algebra, this gets converted to something like the following:


Return
 Product
   For($row)
     Step(child::row)
       Step(child::rows)
         Root
   For($column)
     Step(child::column)
       Step(child::columns)
         Root
 Element(intersection)
   Sequence
     Attribute(row)
       VariableReference($row)
     Attribute(column)
       VariableReference($column)

Below is an excerpt from the query plan, showing this product:


product
 for $row in
   step
     $fs:dot_0/child::rows
     child::row
with
 for $column in
   step
     $fs:dot_0/child::columns
     child::column

The original query plan looks like the columns would be navigated once for each row retrieved from the document. The product prevents this. When the product is evaluated it stores a list of all the tuples from the right hand side, and then each tuple in the left gets combined with each tuple from the right, without reevaluating the inner for loop. In practical terms, all the assignments to $y are recorded and replayed for each value of $x.

For this case the result can be mimicked by storing the definition of $y in a variable, but in the general case the product may not be storing just the values of a single for clause, but a whole set of FLWOR clauses.

Joins

A join is formed when a where clause depends on both sides of a product.

Here is an example query with a join:


for $author in doc("bookshelf.xml")/book/author
for $book in doc("bookshop.xml")/book
where $book/author = $author
return $book

Note that the where expression applies a predicate to a sequence of tuples returning only those which match. Since the two for expressions are independent they form a product, and since the where clause depends on both sides of the product, this is a join.

In this form joins are evaluated no differently to products with a where clause. This is known as a nested loop join. The predicate is applied to the product of the tuples returned from the two for expressions. This gives quadratic performance.

In the query plan a join looks something like the following:


hash join
 for $author in doc("bookshelf.xml")/book/author
to
 for $book in doc("bookshop.xml")/book
on
 $book/author = $author

Hash Join

If he predicate of a join is an equality operator the join then gets converted to a hash join. The expressions on either side of the equality operator are known as the keys. When the right (second) list of tuples is constructed, the right key is evaluated for each tuple. The tuples are stored in a Dictionary mapping from a given key value to a list of tuples that match this key (this dictionary is known as the join index. Once this dictionary is built up then for each left tuple (from the first input), the left key is evaluated. This key is used to look up the matching tuples from the right hand side and and the results are combined and returned. Assuming that only a few tuples from the right hand side match each time then this reduces the complexity from O(n²) to O(n) (assuming constant overhead for a dictionary lookup).

There were many complications in implementing such a join, especially in the general case when the arguments have type xs:anyAtomicType*. If two types are incomparable, a type error should be raised. xs:untypedAtomic values need to be converted to the base type of every type that appears in the other key, and used for comparison. numeric types should be compared as if they had been promoted to the lowest base type. Numeric NaN values never match. If keys are sequences they may match more than one key in the index, but all the matching tuples must be returned once, and in the order they were encountered. Because of all these potential issues, there are several implementations of the hash join in XmlPrime which tackle different combinations of these issues. All the implementations have the same O(n) basic performance characteristics.

Left Outer Joins

A left outer join is similar to a join, but the results are grouped for each input argument. Here is an example of a query with a left outer join:


for $user in doc("users.xml")/user
return
 <user name="{$user/@name}">
 {
   for $comment in doc("comments.xml")/comment
   where $comment/@user = $user/@name
   return $comment
 }
 </user>

When converting this query into the join algebra, XmlPrime extracts each seperate FLWOR expression and puts it into a new let clause (with a new variable; in this case we will call it $comments), in order to group all FLWOR clauses together:


for $user in doc("users.xml")/user
let $comments := for $comment in doc("comments.xml")/comment
                where $comment/@user = $user/@name
                return $comment
return
 <user>{$comments}</user>

The following pattern is then spotted as a left outer join.


[ClausesA]
let $a := [ClausesB]
         where [Predicate]
         return [AggregateExpr]

The above pattern is only a Left outer join if ClausesA and ClausesB are independent, and if Predicate depends
on both ClausesA and ClausesB. Note that in order to reach this condition some (let) clauses may need to be moved from ClausesB to ClausesA and some clauses may need to be moved from ClausesA and concatenated to the entire expression.

The left outer join is evaluated like a standard join, except for each tuple stored from the right hand side we also compute the aggregate expression (or if the join is evaluated lazily, the variables required to compute the aggregate expression) and then for each left hand tuple we find all the tuples that match it. We get (or compute) the aggregate value for these tuples, concatenate (this is normal sequence concatenation, not tuple concatenation) the aggregate values and then store the result in $a.

As with the standard join, both nested-loop and hash variants are supported. Again, in the case of the hash join, assuming only a few tuples match in each case, this reduces an O(n²) operation to O(n).

In the query plan a left outer hash join looks something like the following:


left outer hash join
 for $user in doc("users.xml")/user
to
 for $comment in doc("comments.xml")/comment
on
 $comment/@user = $user/@name
aggregate
 $comment
as $comments

Group-by

One particularly common form of the left outer join is the group-by. A group by clause is where you want to group all items in a sequence by a particular key. Here is an example which groups a list of books by their author.


let $books := doc("bookshelf.xml")/books/book
for $author in distinct-values($books/@author)
return
 <author name="{$author}">{
   $books[@author=$author]
 }</author>

Using the join optimizations so far described this will optimize to a left outer join (and inventing a couple of variables, $book and $matching-books):


let $books := doc("bookshelf.xml")/books/book
left outer hash join
 for $author in distinct-values($books/@author)
to
 for $book in $books
on
 $book/@author = $author
aggregate
 $book
as
 $matching-books
return
 <author name="{$author}">{$matching-books}</author>

The group by is formed when you have a left outer join with the following form:


left outer hash join
 for $x in distinct-values(keys($sequence))
to
 for $y in $sequence
on
 $x = key($y)
aggregate
 $y
as
 $group

In the above plan keys and key are general expressions depending on the variables shown. A left outer join is formed if applying keys to a sequence is the same as applying key to each item in the sequence and concatenating the result. Also with the current implementation key must return zero or one items.

Determining the condition above (applying keys to a sequence is the same as applying key to each item in the sequence) in general is not that easy, and our implementation is not yet perfect, although this improves with every release.

Note also that in practice the left outer join is liable to have many more clauses (in particular common subexpression elimination introduces a lot of let clauses), and these may have to be moved out of the join in order to get it into this form.

If XmlPrime does manage to manipulate the query to be in this form, and all the conditions are satisfied, then the query plan now looks like this:


 for $book in doc("bookshelf.xml")/books/book
group by
 $book/@author
aggregate
 $book
as
 $matching-books
return
 <author name="{$author}">{$matching-books}</author>

The group by clause combines the join and the distinct values operation. It builds a join index, storing the aggregate values as it goes. Iterating over this join index then provides the values for the aggregate variable. Over the left outer join, this group by reduces the amount of memory used, and ensures that the input expression is evaluated only once. In the best case this can reduce the memory usage by more than half. This is because in the left outer join, the input expression is used twice and so all its values are stored in a variable. Also an additional hash table is built up to perform the distinct-values operation.

Conclusion

Hopefully this post has shed some light on the various join optimizations that are performed in XmlPrime. By examining the query plan of a query you can see which join optimizations have been applied.

Most of the improvements to join optimizations come from rewriting queries to make them closer to the forms above. If you have a query which you believe should contain one of the above optimazations, but does not then please let us know as we are keen to spot as many different ways of writing a join as possible.

Tuesday, 11 May 2010

XmlPrime 1.4.0 Released

The release of XmlPrime 1.4.0 has arrived.

The biggest new feature in this version is the ability to write modules as classes in .NET, and call them from XQuery or XPath. Here is an example of a module written in C#


[XdmModule("http://www.mymodule.com/")]
public class MyModule
{
    [XdmFunction("green-bottles")
    public static XPathAtomicValue GreenBottles(XPathAtomicValue count)
    {
        var builder = new StringBuilder();
        for (var x=count.ValueAsDecimal; x>0; x--)
        {
            builder.Append(x);
            builder.Append(" green ");
            builder.Append(x == 1 ? "bottle" : "bottles");
            builder.Append(" sitting on the wall\n")

            builder.Append(x);
            builder.Append(" green ");
            builder.Append(x == 1 ? "bottle" : "bottles");
            builder.Append(" sitting on the wall\n")

            builder.Append("And if one green bottle should accidently fall\n")

            builder.Append("There'll be ");
            builder.Append(x == 1 ? "no" : x - 1);
            builder.Append(" green ");
            builder.Append(x == 2 ? "bottle" : "bottles");
            builder.Append(" sitting on the wall\n")

            if (x != 1)
                builder.Append("\n");
        }

        return XPathAtomicValue.Create(builder.ToString());
    }
}

Which can now be called from XQuery as so:


import module namespace my="http://www.mymodule.com/";

<lyrics>{my:green-bottles(10)}</lyrics>

More detailed information about the improved modules support can be found here.

There have also been a number of performance improvements, in particular to join analysis.

For the full list of new features see the change log.

As usual, you can download your free copy now (for trial or non-commercial use).

Friday, 1 January 2010

Push-based evaluation

The release of XmlPrime 1.2 has introduced eager evaluation. We use the term eager to refer to an expression whose result is consumed in its entirety. Eager evaluation is an evaluation technique that produces the entire result before returning.

Any expression returning a sequence of items was previously evaluated with lazy evaluation. This method of evaluation returns an IEnumerator<T> instance which computes the values of an expression as they are needed, they are "pulled" from the result of the expression.

In XmlPrime 1.2 when an expression is eager (all its values will be read) the expression is evaluated with a push based strategy. Control flow remains with the expression being evaluated, and the results are "pushed" through to the caller, either by populating a list of values, or by passing the result to a callback function that perfoms further processing on each item.

A second form of eager evaluation has also been added. The purpose of this form of evaluation is to stream the result of a query to the serializer. Previously only complete nodes could be returned. This new method splits nodes into a series of events (very similar to the methods on XmlWriter). These are then passed directly to the serializer.

Both the previous methods treat an item as an atomic value. Historically that meant that for any XML data to be returned, a tree had to be constructed. The performance was worst with constructed nodes wrapping input nodes, like the following:


<results>{//result}</results>

The path expression itself returns a sequence of navigators, but in order to incorporate them in a result tree they have to be copied into an internal representation. This copying process effectively doubles the memory usage of the process, and causes a lot more work than is necessary, especially if the result is then going to be written out and discarded. With the new serialized evaluation, the copying is bypassed and the navigators are passed directly to the serializer.

Being able to write the result directly to the serializer is the most obvious benefit of eager evalution. Eager evaluation however is faster in almost all circumstances where all the values are consumed. To illustrate these benefits I will have to go into more depth with the implementation details.

The final pass of compilation transforms the query in to a tree of IEvalauator instances which provide methods to evaluate the expression. The IEvaluator interface provides all the different evaluation methods.

For this example I will consider the evaluation of a sequence of expressions that are concatenated together. This is implemented as an evaluator class in XmlPrime. The lazy evaluation method returns an instance of the following enumerator:


  internal sealed class FlattenSequenceEnumerator<T> : IEnumerator<T>
  {
    private readonly DynamicContext _context;
    private readonly IEvaluator<T>[] _subsequences;
    private ISeekableEnumerator<T> _currentEnumerator;
    private int _nextIndex;

    public FlattenSequenceEnumerator(IEvaluator<T>[] subsequences, 
                                     DynamicContext context)
    {
      _subsequences = subsequences;
      _currentEnumerator = _subsequences[0].EvaluateLazySequence(context);
      _nextIndex = 1;
      _context = context;
    }

    public override T Current
    {
      get
      {
        return _currentEnumerator.Current;
      }
    }

    public override bool MoveNext()
    {
      while (!_currentEnumerator.MoveNext())
      {
        if (_nextIndex >= _subsequences.Length)
          return false;
        _currentEnumerator =
           _subsequences[_nextIndex++].EvaluateLazySequence(_context);
      }

      return true;
    }
  }

Note: The code has been simplified and reformatted for this blog, and some members have been omitted.

Every time an item is retrieved, the MoveNext method and the Current property is retrieved. Both these calls pass execution to the next enumerator in the chain. However the MoveNext method then needs to check if the enumerator has been exhausted, and if so obtain the next enumerator in the sequence. The overhead of concatenating sequences is two virtual function calls and some state checking for every item returned. Whilst this doesn't sound like much remember that this is just a small part of a tree of evaluations and in more complicated expressions (for example the closures to lazy arguments to recursive functions) this can extend to hundreds of virtual calls for each item returned, from overhead from using enumerators.

Admittedly there are tricks that can be used to reduce the number of virtual calls, for example merging the MoveNext and Current method into one function, or introducing a method that allows a new enumerator to be returned, which would avoid the extra calls for the final argument. Whilst these would reduce the overhead, it would not reduce the fact that the overhead grows proportionally to both the complexity of the expression and the number of items returned.

Now let us consider the eager evaluation methods for this unit:


public override void EvaluateEagerSequence(List<T> items,
                                           DynamicContext context)
{
  for (var i=0; i<_evaluators.Length; i++)
    _evaluators[i].EvaluateEagerSequence(items, context);
}

public override void EvaluateEagerSequence(Action<T> action,
                                           DynamicContext context)
{
  for (var i = 0; i < _evaluators.Length; i++)
    _evaluators[i].EvaluateEagerSequence(action, context);
}

The overhead in this case of enumerating the sequence of items is just the cost of enumerating over an array. There are no per-item costs.

So why don't we use this evaluation method all the time? The main advantages of lazy evaluation is that the values are not computed until they are needed, and if they are not needed they are not computed. Lazy evaluation leaves the caller in full control of the program flow, but eager evaluation keeps control of the program flow and only returns it once all the values have been computed.

The costs are inherent in the IEnumerator model, and derive from the fact that the enumerator must be yield control of what it is doing and be able to continue where it left off.

The ability to abort computation from an eager evaluation could be implemented by raising an exception when the last item required has been processed (although this comes at a cost).

The most efficient way to implement the ability to halt and continue processing in eager evaluation is with coroutines. This would combine all the benifits of both eager and lazy evaluation. Support for coroutines have recently been added to Mono. They are not supported by the Microsoft .NET runtime however and will not be in the near future so it may be many years before we can try this approach.

Tuesday, 22 December 2009

XmlPrime 1.2 Released

XmlPrime 1.2 has just been released.

The major new features in this release are push-based evaluation.

Queries can now be evaluated directly to an XmlWriter or directly serialized to a TextWriter or Stream. This prevents an immediate result tree from being constructed, which reduces memory usage, and allows the result to be returned sooner, and quicker.

All expressions returning sequences can now be evaluated via push-based mechanisms. This method of evaluation is more efficient, but only where the entire sequence is consumed. This optimization is applied automatically to any expression that is consumed eagerly.

Expect a post next week with a more detailed analysis of this feature.

See the full change log for more information about the changes in this version, or download the new version now.

Thursday, 17 December 2009

Avoiding StackOverflowExceptions

This post describes several situations that can cause StackOverflowExceptions in XmlPrime, and how to avoid them.

Look at the following fairly innocuous XQuery function:


declare function local:fact($x)
{
 if ($x = 0)
 then 1
 else $x * local:fact($x-1)
};

As you probably expect this function works fine for small values of $x, but as $x grows large then the function will overflow the stack. The stack is overflowed far more quickly than a direct C# implementation of this function, as a function call is made for each expression. In this case the call stack contains one call to prepare an (XmlPrime) stack frame, one call to bind the arguments, one call to evaluate the if expression, and finally a call to evaluate the multiplication operator before recursing. This means that the stack will overflow quite quickly.

So, what can we do about this?

The trick to solving this problem is an optimization called tail recursion. If a function returns the result of a function call as a value, then the runtime can evaluate this new function call in place of the old one, without growing the call stack. Let me try to demonstrate what I mean by this. The factorial function we defined above is not tail recursive, as the value from the recursive function call is not returned directly, but is multiplied by $x after it is returned. We can rewrite the function in order to make it tail recursive, by adding an accumulating argument:


declare function local:fact2($x, $accum)
{
 if ($x = 0)
 then $accum
 else local:fact($x - 1, $accum * $x)
};

declare function local:fact($x)
{
 local:fact($x, 1)
};

Note that the value of the recursive call is now returned directly, this means that the tail-recursion optimization can be applied. Evaluating this function is now a lot more like evaluating the following C# function:


int Fact(int x)
{
 int accum = 1;
 while (x != 0)
 {
   x = x-1;
   accum = accum * x;
 }
 return accum;
}

As you can see, evaluated in this way the function should no longer trigger a stack overflow exception. Unfortunately though, the function had to be manually rewritten in order to take advantage of this optimization. In the future we would like to try to make this transformation automatically, to widen the set of functions in which tail recursion can be applied.

Does this solve our problem? For this case, yes, but things can get more complicated.

Consider writing our own fold function. fold takes a function f that takes two arguments and returns an accumulated value. fold applies this function to a sequence by applying it to current result and each item in turn to build up a final result. For example if we fold the addition function over a sequence we get the sum of the sequence, and if we fold the multiply function over a sequence we get the product.

In XQuery 1.1, fold can be written like this:


declare function local:fold($func as function(item()*, item()) as item()*,
                            $seq as item()*,
                            $accum as item()*)
{
 if (empty($seq))
 then $accum
 else local:fold($func, subsequence($seq, 2), $func($accum, $seq[1]))
};

However with XQuery 1.0, which does not support higher-order functions, we have to fake it by passing around a node to represent the function, and writing an apply function that applies it:


declare function local:apply($func as node(), 
                             $accum as item()*,
                             $arg as item())
{
 typeswitch ($func)
 case element(add) return $accum + $arg
 case element(multiply) return $accum * $arg
 case element(reverse) return ($arg, $accum)
 ...
 default error((), "Function not recognised")
};

declare function local:fold($func as node(),
                            $seq as item()*,
                            $accum as item()*)
{
 if (empty($seq))
 then $accum
 else local:fold($func,
                 subsequence($seq, 2),
                 local:apply($func, $accum, $seq[1]))
};

local:fold is tail recursive and so we should never get a stack overflow, right?

Unfortanately, this is not the case. The problem occurs because local:apply (or $func in the XQuery case) is not nescassarily strict in the accumulator argument. In the XQuery 1.0 case this is because we might hit the default clause in the local:apply function, and in the XQuery 1.1 case we might have a function like:


declare function last($accum as item()*, $item as item()) as item*
{
 $item
};

which when folded over a sequence returns the last item.

Aside: In the XQuery 1.0 version, if you are to call fold with a statically known function, then XmlPrime may specialize the fold function and create a new version inlining the function application. This simplifies the function enough to identify that the argument is strict. This may not always be the case though, so it shouldn't be relied on.

Since $accum may not be evaluated, it is defined lazily. So rather than applying the function and passing the value to the recursive function, a closure is created and the value is computed later.

Consider evaluating the function call


local:fold(<add />, (1,2,3,4,5), 0)

After the first iteration the closure for $accum looks like:


local:apply(<add />, 0, 1)

The next iteration applies the value 2 to $accum, and so the closure for $accum looks like


local:apply(<add />,
  local:apply(<add />, 0, 1), 2)

And when the function finally returns $accum it returns a closure that evaluates


local:apply(<add />,
  local:apply(<add />,
    local:apply(<add />,
      local:apply(<add />,
        local:apply(<add />, 0, 1), 2), 3), 4), 5)

When $seq is quite large, computing the result of this closure will result in a stack overflow error.

So how can we fix this?

In order to fix this we need to ensure that the argument for the recursive call is evaluated fully. In the general case there is no easy way to fix this beyond ensuring that the argument is used entirely in every eventuality. In order to facilitate this we have added a new xmlprime:eager extension expression that will be available in the upcoming XmlPrime 1.1 release.

To force the argument to be fully evaluated we can annotate the expression like this:


declare namespace xp = "http://www.xmlprime.com/"

declare function local:fold($func as node(),
                            $seq as item()*,
                            $accum as item()*)
{
 if (empty($seq))
 then (# xmlprime:eager #) { $accum }
 else local:fold($func,
                 subsequence($seq, 2),
                 local:apply($func,
                             (# xmlprime:eager #) { $accum },
                             $seq[1])))
};

Now $accum is evaluated fully in both branches. This forces the strictness analysis to identify $accum as a strict variable and hence the function becomes a simple tail recursive function and is more resiliant to stack overflow errors.

This fold function still has a problem that can cause stack overflow errors. The $seq argument is evaluated lazily (as each iteration only looks at the first value of $seq). Whilst technically the function is strict in $seq (as it does inspect every value eventaully), the current analysis engine is not capable of spotting this. It is impossible to always determine the strictness of an argument with 100% certainty as this can be specialized to the halting problem. The value of $seq ends up in a closure that looks something like this:


subsequence(
  subsequence(
    subsequence(
      subsequence(
        subsequence(..., 2), 2), 2), 2), 2)

This will eventually cause stack overflow errors.

This case of using the first item of a sequence and recursing with the remainder of the sequence is an important and common one in functional programs. In a future version of XmlPrime we would like to spot this case, read the first few items from the enumerator, and recurse by passing the same enumerator to the next iteration. This would avoid building a closure to evaluate the subsequence operation, and avoid caching the values returned and so should drastically improve performance of this kind of function (probably by a factor of about O(n) where n is the depth of recursion). Unfortunately this feature has not yet been implemented, and so this problem enforces a limit on the number of items in a sequence that can be processed in this manner.

You may be wondering why it is necessary to put the eager extension on the then branch of the if expression. Consider the following query, which reverses a sequence:


local:fold(<reverse />, $seq, ())

This query behaves properly and avoids the problem. However, lets make a simple modification to the query, to return the $nth last item in $seq:


local:fold(<reverse />, $seq, ())[$n]

For large values of $n removing the xmlprime:eager around the then clause this would cause the query to stack-overflow. This is because we do not always use all of the result of the function, and so the variable is evaluated lazily. This causes a tree of enumerators that perform the concatenation to be produced, and can cause a stack overflow when iterating over the sequence.

This case is another case that we would like to fix in a future version of XmlPrime. To fix this we would need to implement a custom MoveNext method that rather than returning a boolean value would return a new IEnumerator to use to enumerate the rest of the sequence (or null if we are at the end of the sequence). This would avoid pushing a new (C#) stack frame for each concatenation in a lazy result.

Monday, 2 November 2009

Using XmlPrime to access SQL databases

One of the uses of XQuery is to query relational databases. This posting is going to demonstrate how one might use XmlPrime to access a SQL database.

To access various data sources you can write a collection resolver implementing the ICollectionResolver interface. To do this you need to implement the following method:

public IXPathNavigable[] Resolve(AnyUri uri,
                                 DocumentSet documentSet,
                                 XmlNameTable nameTable,
                                 XsltWhitespaceSettings whitespaceSettings)

First thing we need is a URI format to indicate we are connecting to the database, for example:

mssql://username:password@server/database/table

We need to convert this URI to a SQL connection string, this can be done as follows:

var connectionString = new StringBuilder();
connectionString.Append("Server=" + uri.Host + ";");
var split = uri.Path.Split('/');
var table = split[2];
connectionString.Append("Database=" + split[1] + ";");
var userInfo = uri.UserInfoItems;
if (userInfo == null)
   connectionString.Append("Trusted_Connection=True;");
else
{
   connectionString.Append("User ID=" + userInfo[0] + 
                           ";Password="+userInfo[1] + ";");
}

We can now connected to the database and generate some XML nodes for our collection. The XML format is going to be:

<row column1="value1" column2="value2"/>
<row column1="value3" column2="value4"/>

We now open the database and get the data from the specified table:

using (var connection = new SqlConnection(connectionString.ToString()))
{
   connection.Open();
   var command = new SqlCommand("SELECT * FROM " + table, connection);
   var collection = new List<IXPathNavigable>();
   var doc = new XmlDocument(nameTable);
   using (var reader = command.ExecuteReader())
   {
      while (reader.Read())
      {
         ...
      }
   }
   return collection.ToArray();
}

The XML nodes need to be created with the same name table as the query is using, hence an XmlDocument is created with the name table passed into the Resolve method.

Next we need some code to put in the while loop to generate the XML elements for each row. A certain degree of mapping is required between SQL field types and XML types. This code shows a subset of possible field types for demonstration purposes.

var row = doc.CreateElement(table);
for(var i = 0;i<reader.FieldCount;i++) {
   if (!reader.IsDBNull(i))
   {
      var type = reader.GetFieldType(i);
      XPathAtomicValue value;
      if(type == typeof(Boolean))
         value = XPathAtomicValue.Create(reader.GetBoolean(i));
      else if(type = typoeof(Byte[])) {
         var length = reader.GetBytes(i, 0, null, 0, 0);
         var bytes = new byte[length];
         reader.GetBytes(i, 0, bytes, 0, (int)length);
         value = XPathAtomicValue.CreateBase64Binary(bytes);
      }
      else if (type == typeof(DateTime))
         value = XPathAtomicValue.Create(
                   new DateTimeWithZone(reader.GetDateTime(i), (TimeSpan?)null)
                 );
      else if (type == typeof(Int32))
         value = XPathAtomicValue.Create(reader.GetInt32(i));
      else if (type == typeof(String))
         value = XPathAtomicValue.Create(reader.GetString(i));
      else
         // TODO: similar methods will need to be implemented
         // for other SQL field types
         throw new NotImplementedException();
      var attr = d.CreateAttribute(reader.GetName(i));
      attr.Value = value.Value;
      row.Attributes.Append(attribute);
   }
}
collection.Add(row);

So the whole class will be:

using System;
using System.Collections.Generic;
using System.Data.SqlClient;
using System.Text;
using System.Xml;
using System.Xml.XPath;

namespace xp.Connectors
{
   public class SimpleSqlCollectionResolver : ICollectionResolver
   {
      public IXPathNavigable[] Resolve(AnyUri uri,
                                 DocumentSet documentSet,
                                 XmlNameTable nameTable,
                                 XsltWhitespaceSettings whitespaceSettings)
      {
         if (uri != null && uri.Scheme == "mssql" && uri.Path != null)
         {
            var connectionString = new StringBuilder();
            connectionString.Append("Server=" + uri.Host + ";");
            var split = uri.Path.Split('/');
            var table = split[2];
            connectionString.Append("Database=" + split[1] + ";");
            var userInfo = uri.UserInfoItems;
            if (userInfo == null)
               connectionString.Append("Trusted_Connection=True;");
            else
            {
               connectionString.Append("User ID=" + userInfo[0] +
                                       ";Password="+userInfo[1] + ";");
            }

            using (var connection = new SqlConnection(connectionString.ToString()))
            {
               connection.Open();
               var command = new SqlCommand("SELECT * FROM " + table, connection);
               var collection = new List<IXPathNavigable>();
               var doc = new XmlDocument(nameTable);
               using (var reader = command.ExecuteReader())
               {
                  while (reader.Read())
                  {
                     var row = doc.CreateElement(table);
                     for (var i = 0; i < reader.FieldCount; i++)
                     {
                        if (!reader.IsDBNull(i))
                        {
                           var type = reader.GetFieldType(i);
                           XPathAtomicValue value;
                           if (type == typeof(Boolean))
                              value = XPathAtomicValue.Create(reader.GetBoolean(i));
                           else if (type == typeof(Byte[]))
                           {
                              var length = reader.GetBytes(i, 0, null, 0, 0);
                              var bytes = new byte[length];
                              reader.GetBytes(i, 0, bytes, 0, (int)length);
                              value = XPathAtomicValue.CreateBase64Binary(bytes);
                           }
                           else if (type == typeof(DateTime))
                              value = XPathAtomicValue.Create(new DateTimeWithZone(reader.GetDateTime(i), (TimeSpan?)null));
                           else if (type == typeof(Int32))
                              value = XPathAtomicValue.Create(reader.GetInt32(i));
                           else if (type == typeof(String))
                              value = XPathAtomicValue.Create(reader.GetString(i));
                           else
                              throw new NotImplementedException();
                           var attr = doc.CreateAttribute(reader.GetName(i));
                           attr.Value = value.Value;
                           row.Attributes.Append(attr);
                        }
                     }
                     collection.Add(row);
                  }
               }
               return collection.ToArray();
            }
         }
         return null;
      }
   }
}

This class can now be passed into a DocumentSet which can then be set on the DynamicContextSettings of a query.

var documentSet = new DocumentSet(nameTable,null,
                                       new SimpleSqlCollectionResolver(),
                                       null);
var dynamicSettings = new DynamicContextSettings {DocumentSet = documentSet};

We can then run queries accessing the database by using:

collection('mssql://SERVER/Northwind/Employees')

or if we are accessing multiple tables from the same database a nice syntax is

declare base-uri "mssql://SERVER/Northwind/";
collection('Employees')

Of course, the performance of this will be appalling compared to native SQL queries. In a future release of XmlPrime we aim to offer native SQL support with a number of the XQuery operations mapped onto SQL operations and run on the database, rather than loading whole tables as XML documents.

In another post we'll look at implementing ICollectionTypeResolver so that our query optimizer starts taking advantage of the field static type information when compiling the query.

Wednesday, 28 October 2009

Variables and Strictness analysis in XmlPrime 1.0

In the last post I discussed the different ways in which variables are placed on the stack in XmlPrime 0.9, and the strictness analysis that is performed in order to determine which of these methods is most appropriate for each variable.

In this post I will explain how this mechanism has been expanded for XmlPrime 1.0.

In 0.9 variables were considered either to be eager, or lazy. For XmlPrime 1.0 we redefined these terms slightly. A variable is strict if it is always evaluated at least in part. A variable is eager if whenever it is evaluated it is evaluated in its entirety. If a variable is not eager, it is lazy. This provides us with a middle ground of a non-strict eager variable. $x in the code sample below is an example of a non-strict eager variable.


if (...)
then
  sum($x)
else
  ()

These variables are placed on the stack in closures, but when they are needed they are evaluated in their entirety, and the value on the stack is replaced with the strict version. By evaluating the variable in its entirety, the extra layer of caching that was required for lazy variables in XmlPrime 0.9 is not needed.

Both of these conditions propagate throughout the query in a similar way so the only significant barrier for development was annotating all the built in expressions and functions to more accurately describe how they might use their arguments. This may depend on how the function itself is evaluated.

The biggest change to variables in XmlPrime 1.0 is that they can now be unboxed. Recall that unboxing is one of new optimizations in XmlPrime 1.0. In XmlPrime 0.9, all entries on the stack were instances of IEnumerable<T> for items. In order for unboxing to have a significant effect we needed to extend the stack to allow unboxed values. In XmlPrime 1.0 all entries on the stack implement IStackEnumerable which provides IEnumerable<T> implementations for all types that are used in XmlPrime (although in most cases most of the GetEnumerator methods throw an exception).

Implementations of all the types of closure are generic and can be used to store lists or closures of any unboxed type. The implementations for XdmItem
also allow for evaluation as appropriate unboxed types.

Whilst individual items need not be boxed, the value as a whole still needs to be wrapped in a suitable stack entry. This can be mitigated in some cases, for example variables defined in for expressions now share the same stack entry object for every item in their definition.

As with unboxed expressions, a variable can only be unboxed if its dynamic type is not needed to be preserved anywhere that the variable is used. The strictness analysis was extended with a new property - a variable is unboxable if its type will never be inspected. The strictness analysis pass was extended to propagate this property as well, and so all the information required to identify which type of closure is to be created at runtime is described by the static type of the variable, and these three properties. The combination of these three properties is known as the evaluation strategy.

The techniques above fairly thoroughly cover variables introduced by let clauses. Local variables are also introduced by function arguments, and so a lot of work has gone into these for XmlPrime 1.0.

In XmlPrime 0.9 all function arguments were evaluated eagerly. Clearly this is not always beneficial. The strictness analysis rules have been extended to be applied to user defined function arguments.

The first challenge that is that the evaluation strategy of an argument depends on the evaluation strategy of the function itself. The effect of the strict property is easy to determine - if a function is not strict (might not be evaluated) then neither is its arguments. As it turns out strictness and eagerness of a function will never affect the unboxability of its arguments and vice versa. The strictness rules are applied twice to every function - once assuming that it is evaluated eagerly and unboxed, and once assuming that it is evaluated lazily and boxed. The evaluation strategy for each argument in each case is stored, and these can be combined to determine the evaluation strategy of each argument wherever the function is called.

The other big challenge for computing the evaluation strategy for function arguments comes with recursive (or mutually recursive functions). How do we determine whether $y is strict in the following case for example?


declare function local:f($x, $y)
{
  if ($x = 0)
  then $y
  else f($x - 1, $y)
}

The solution to this problem is to assume that every argument to every recursive function is strict, lazy and not unboxable. After the strictness analysis has been applied, if any of the evaluation strategies have been updated then the analysis is run again for any function that calls it. The analysis is repeated until the stratgies do not change. This algorithm ensures that optimal evaluation strategies are always selected.

Another improvement that has been applied to function calls is that if a function calls another function, then often it is passing local variables as arguments. Using the model described above, these arguments are either fully evaluated, or a lazy closure is created wrapping the previous definition. If a variable is fully evaluated then this causes an extra copy of the variables value to be stored in memory every time the function recurses. Adding an extra lazy thunk causes an extra level of caching to be added to the variable each time it recurses. This latter effect can compound causing successive recursions of a function perform increasingly badly. XmlPrime adds a simple check for this case during code generation - if a local variable is passed as an argument, then the stack entry is simply copied into the new stack frame.

This case (setting a variable to be the value of another variable) only occurs for function arguments, as otherwise it would have been rewritten by the optimizer. Another case that is eliminated by the optimizer is local variables whose definitions are values. When a function is called with a constant argument, the stack entry can be constructed ahead of time, and so no boxing is needed to make the call.

There are still many ways in which argument passing could be much improved. One case is for lazy variables whose definitions are only inspected once (this can only occur in function arguments, as otherwise the variable would have been eliminated). Currently the values are cached as the variable is evaluated, but this is unnecessary if the variable will never be inspected again. Look out for improvements like this in future versions of XmlPrime.

The XmlPrime Blog

Tuesday, 25 May 2010

Join optimizations in XmlPrime

Tuples

Products

Joins

Hash Join

Left Outer Joins

Group-by

Conclusion

Tuesday, 11 May 2010

XmlPrime 1.4.0 Released

Friday, 1 January 2010

Push-based evaluation

Tuesday, 22 December 2009

XmlPrime 1.2 Released

Thursday, 17 December 2009

Avoiding StackOverflowExceptions

Monday, 2 November 2009

Using XmlPrime to access SQL databases

Wednesday, 28 October 2009

Variables and Strictness analysis in XmlPrime 1.0

About XmlPrime

Blog Archive

Tuesday, 25 May 2010

Tuples

Products

Joins

Hash Join

Left Outer Joins

Group-by

Conclusion

Tuesday, 11 May 2010

Friday, 1 January 2010

Tuesday, 22 December 2009

Thursday, 17 December 2009

Monday, 2 November 2009

Wednesday, 28 October 2009

About XmlPrime

Subscribe

Blog Archive