XSLT Unicode Horror

Different Java XSLT implementation have different handling of UTF-8 characters. Here is test code that parses UTF-8 XML into DOM document and then serializes it using a transformer.

  1. System.out.println("    SOURCE:  " + source);
  2. DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance(parserClass, TestUnicode.class.getClassLoader());
  3. Document document = builderFactory.newDocumentBuilder().parse(new InputSource(new StringReader(source)));
  4. TransformerFactory transformerFactory = TransformerFactory.newInstance(transformerClass, TestUnicode.class.getClassLoader());
  5. StringWriter writer = new StringWriter();
  6. transformerFactory.newTransformer().transform(new DOMSource(document), new StreamResult(writer));
  7. System.out.println("    RESULT:  " + writer.toString());

I tested following transformers:

  • Xalan 2.7.1:

    • org.apache.xalan.processor.TransformerFactoryImpl
    • org.apache.xalan.xsltc.trax.TransformerFactoryImpl
    • org.apache.xalan.xsltc.trax.SmartTransformerFactoryImpl
  • Sun-Xalan (an internal transformer factory present in Sun JDK 5 and 6):

    • com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl
  • Saxon 8.7:

    • net.sf.saxon.TransformerFactoryImpl

Here are the results for Mathematical Script Capital D character: 𝒟

  1. TRANSFORMER: org.apache.xalan.processor.TransformerFactoryImpl
  2.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>𝒟</foo>
  3.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>&#55349;&#56479;</foo>
  4.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>&#119967;</foo>
  5.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>&#55349;&#56479;</foo>
  6. TRANSFORMER: org.apache.xalan.xsltc.trax.TransformerFactoryImpl
  7.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>𝒟</foo>
  8.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>&#55349;&#56479;</foo>
  9.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>&#119967;</foo>
  10.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>&#55349;&#56479;</foo>
  11. TRANSFORMER: org.apache.xalan.xsltc.trax.SmartTransformerFactoryImpl
  12.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>𝒟</foo>
  13.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>&#55349;&#56479;</foo>
  14.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>&#119967;</foo>
  15.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>&#55349;&#56479;</foo>
  16. TRANSFORMER: com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl
  17.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>𝒟</foo>
  18.     RESULT:  <?xml version="1.0" encoding="UTF-8" standalone="no"?><foo>&#119967;</foo>
  19.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>&#119967;</foo>
  20.     RESULT:  <?xml version="1.0" encoding="UTF-8" standalone="no"?><foo>&#119967;</foo>
  21. TRANSFORMER: net.sf.saxon.TransformerFactoryImpl
  22.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>𝒟</foo>
  23.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>𝒟</foo>
  24.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>&#119967;</foo>
  25.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>𝒟</foo>

Or, summarized in a table:

  𝒟 &#119967;
Xalan 2.7.1 &#55349;&#56479; &#55349;&#56479;
Sun-Xalan (Sun JDK 1.5+) &#119967; &#119967;
Saxon 8.7 𝒟 𝒟

The results were the same regardless of the parser implementation. Xerces or Saxon.

Xalan’s handling of UTF-8 multi-byte characters seems to be seriously flawed. &#55349;&#56479; are not valid UTF-8 characters and both Xerces and Saxon parsers will throw SAXParseException when trying to parse documents that have them.

Category: Software | Tags: , , , , , , , , 3 comments »

3 Responses to “XSLT Unicode Horror”

  1. Gareth Barber

    Hey man

    What did you do to fix this? Any workarounds or solutions?

  2. Dragisa Krsmanovic

    Workaround is to use either Saxon or the internal JDK implementation. Xalan seems to be broken.

  3. Jason

    Thanks for posting this. You’ve put together a nice summary. I am having the same problem now.


Leave a Reply



 

Back to top