Tag: utf-8


XSLT Unicode Horror

May 18th, 2010 — 10:50pm

Different Java XSLT implementation have different handling of UTF-8 characters. Here is test code that parses UTF-8 XML into DOM document and then serializes it using a transformer.

  1. System.out.println("    SOURCE:  " + source);
  2. DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance(parserClass, TestUnicode.class.getClassLoader());
  3. Document document = builderFactory.newDocumentBuilder().parse(new InputSource(new StringReader(source)));
  4. TransformerFactory transformerFactory = TransformerFactory.newInstance(transformerClass, TestUnicode.class.getClassLoader());
  5. StringWriter writer = new StringWriter();
  6. transformerFactory.newTransformer().transform(new DOMSource(document), new StreamResult(writer));
  7. System.out.println("    RESULT:  " + writer.toString());

I tested following transformers:

  • Xalan 2.7.1:

    • org.apache.xalan.processor.TransformerFactoryImpl
    • org.apache.xalan.xsltc.trax.TransformerFactoryImpl
    • org.apache.xalan.xsltc.trax.SmartTransformerFactoryImpl
  • Sun-Xalan (an internal transformer factory present in Sun JDK 5 and 6):

    • com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl
  • Saxon 8.7:

    • net.sf.saxon.TransformerFactoryImpl

Here are the results for Mathematical Script Capital D character: 𝒟

  1. TRANSFORMER: org.apache.xalan.processor.TransformerFactoryImpl
  2.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>𝒟</foo>
  3.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>&#55349;&#56479;</foo>
  4.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>&#119967;</foo>
  5.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>&#55349;&#56479;</foo>
  6. TRANSFORMER: org.apache.xalan.xsltc.trax.TransformerFactoryImpl
  7.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>𝒟</foo>
  8.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>&#55349;&#56479;</foo>
  9.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>&#119967;</foo>
  10.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>&#55349;&#56479;</foo>
  11. TRANSFORMER: org.apache.xalan.xsltc.trax.SmartTransformerFactoryImpl
  12.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>𝒟</foo>
  13.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>&#55349;&#56479;</foo>
  14.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>&#119967;</foo>
  15.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>&#55349;&#56479;</foo>
  16. TRANSFORMER: com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl
  17.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>𝒟</foo>
  18.     RESULT:  <?xml version="1.0" encoding="UTF-8" standalone="no"?><foo>&#119967;</foo>
  19.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>&#119967;</foo>
  20.     RESULT:  <?xml version="1.0" encoding="UTF-8" standalone="no"?><foo>&#119967;</foo>
  21. TRANSFORMER: net.sf.saxon.TransformerFactoryImpl
  22.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>𝒟</foo>
  23.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>𝒟</foo>
  24.     SOURCE:  <?xml version="1.0" encoding="UTF-8"?><foo>&#119967;</foo>
  25.     RESULT:  <?xml version="1.0" encoding="UTF-8"?><foo>𝒟</foo>

Or, summarized in a table:

  𝒟 &#119967;
Xalan 2.7.1 &#55349;&#56479; &#55349;&#56479;
Sun-Xalan (Sun JDK 1.5+) &#119967; &#119967;
Saxon 8.7 𝒟 𝒟

The results were the same regardless of the parser implementation. Xerces or Saxon.

Xalan’s handling of UTF-8 multi-byte characters seems to be seriously flawed. &#55349;&#56479; are not valid UTF-8 characters and both Xerces and Saxon parsers will throw SAXParseException when trying to parse documents that have them.

3 comments » | Software

Back to top