Tuesday, 26 July 2016

.NET Core and the state of XML

We had an enquiry regarding using XmlPrime with .NET Core.  We were quite disappointed to find that porting isn't going to be easy.

System.Xml

If I've understood the code correctly, there are some worrying changes in System.Xml.  There are minor niggles, such as the disappearance of Close(), but that's not a major headache.  Some members relating to System.Xml.Schema have gone, which means it will be difficult to support schema-aware processing  (more below).

The loss of XmlResolver as a public class is most definitely in the category of "blocking issue".  XmlResolver is now internal and appears to have a single concrete implementation which can only resolve URLs to files on the file system.  It's the mechanism we use for resolving DTDs, XSDs, XQuery modules, XSLT files and probably much more besides.

On the one hand, it's always been a bit of a frustrating class.  The "role" attribute never got used.  It would have been nice to be informed whether a resolver was resolving a PUBLIC or SYSTEM identifier, or whether it was resolving a DTD or an XSLT file included from another transform.

As an aside, there have been other annoying breaking changes in .NET 4.5.2 in this area, such as the case of the mysteriously disappearing external entities.


System.Xml.XPath

I had thought that this was largely intact until I discovered that XPathItem's default constructor is now internal.  That means we can't derive XPathAtomicValueXPathFunctionItemXPathMap and XPathArray, the classes representing the XPath and XQuery data model items.

If we can't use XPathItem as a base class, that pretty much rules out using XPathNavigator, which in turn pretty much rules out using System.Xml.XPath at all as things stand.

System.Xml.Schema

This is gone.  In a sense that's a good thing, as it gives us a clean slate to produce a fully compliant XSD 1.0 and XSD 1.1 implementation, but that's no small task.  In the meantime, we could implement just enough to support non-schema aware processing.

System.Xml.Xslt

This is gone, and was the main reason for us wanting to port XmlPrime.

Conclusions

In summary, there seems to have been quite a bit of damage done to .NET's XML handling, and unless I'm mistaken or some of the issues given above are addressed, it will be a while before we can port to .NET Core.



XslCompiledTransform: Speed versus Correctneess

Performance was a key requirement in the development of XmlPrime 3.0.  The benchmark for XSLT performance on the .NET framework is of course the built-in System.Xml.Xsl.XslCompiledTransform class, which superceded the old XslTransform class.

First off, it's worth pointing out the obvious difference.  XmlPrime is an XSLT 2.0 processor, whereas XslCompiledTransform only supports XSLT 1.0.  This means we can only compare the two processors executing XSLT 1.0 stylesheets, which requires XmlPrime to operate in "backwards-compatiblilty" mode, as defined by the XSLT 2.0 specification.  Operating in this mode means that we are not on a level playing field.

Numeric operations

In XSLT 1.0, all numbers are double-precision floating point.  In XSLT 2.0, numbers may be integers, decimals or floating point numbers.  To meet the conformance requirements of XSLT 2.0, we use System.Decimal to represent both integer and decimal types.  Mathematical operations can therefore be a little slower than using System.Double.  For a fairer comparison, we could update the XSLT 1.0 stylesheet to XSLT 2.0 and ensure that all numeric literals are written as xs:double values.

String operations

The implementation of string operations in XslCompiledTransform is wrong.  Just try

<xsl:value-of select="substring('&#x1f4a9', 1, 1)" />

The input is a string with a single unicode codepoint. This string is represented by a System.String instance of length two.  The string consists of the two System.Char values which form the surrogate pair

The correct response for the above code is to output &#x1f4a9; to the result document.  Instead, XslCopmiledTransform outputs half of the surrogate pair.  I suspect this is a case where performance outweights correctness.  The string handling functions such as string-length, substring-before, substring-after and substring are all considerably simpler if yone ignores the requirements of the specification.

Access to Internals

It is most frustrating when you discover a really useful method or property of a class, only to discover that it is internal.  One such case is the UniqueId property of XPathNavigator.  This property is exactly what you would want to call to implement the fn:generate-id function efficiently.  However, because it is internal, it is just out of reach (without doing something nasty).

Type Inference

Suppose an XSLT 1.0 stylesheet contains the code

<xsl:template name="add">
  <xsl:param name="a" /> 
  <xsl:param name="b" />
  <xsl:value-of select="$a + $b" />
</xsl:template>

The compiler can't be sure what will be supplied for parameters a, and b.  We might expect them to be numeric, but the caller could equally well pass in a node set and a boolean value.  The clever compiler might check all usages of template add and discover that it is always called with a numeric value, thus avoiding any runtime checks.

XSLT 2.0 permits a stylesheet to be invoked by specifying an initial template.  In that case, we don't have a "closed world" over which to run a static analysis.  That is, we can't determine statically that add will always be supplied with numeric values.  We could of course compile a specialized version of this template.  This is something XmlPrime may do in the future.

Final Thoughts

In our benchmarking, we've seen cases where XmlPrime will outpace XslCompiledTransform on XSLT 1.0 stylesheets.  There are also cases where XslCompiledTransform is a clear winner, and this is due at least in part to the differneces discussed above.  When time permits, it would be interesting to convert some of the XSLT 1.0 benchmarks to XSLT 2.0 with an eye to extracting the best possible performance from an XSLT 2.0 processor.

As ever, the advice is to measure performance for your use case and put a little effort into updating stylesheets where needed.