{"id":131,"date":"2010-05-18T22:50:36","date_gmt":"2010-05-19T05:50:36","guid":{"rendered":"http:\/\/www.dragishak.com\/?p=131"},"modified":"2021-02-09T10:23:51","modified_gmt":"2021-02-09T18:23:51","slug":"xslt-unicode-horror","status":"publish","type":"post","link":"https:\/\/www.dragishak.com\/?p=131","title":{"rendered":"XSLT Unicode Horror"},"content":{"rendered":"<p>Different Java XSLT implementation have different handling of UTF-8 characters. Here is test code that parses UTF-8 XML into DOM document and then serializes it using a transformer.<\/p>\n<pre class=\"brush: java; title: ; notranslate\" title=\"\">\nSystem.out.println(&quot;    SOURCE:  &quot; + source);\nDocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance(parserClass, TestUnicode.class.getClassLoader());\nDocument document = builderFactory.newDocumentBuilder().parse(new InputSource(new StringReader(source)));\nTransformerFactory transformerFactory = TransformerFactory.newInstance(transformerClass, TestUnicode.class.getClassLoader());\nStringWriter writer = new StringWriter();\ntransformerFactory.newTransformer().transform(new DOMSource(document), new StreamResult(writer));\nSystem.out.println(&quot;    RESULT:  &quot; + writer.toString());\n<\/pre>\n<p>I tested following transformers:<\/p>\n<ul style=\"list-style: none;\">\n<li><a href=\"http:\/\/xml.apache.org\/xalan-j\/\">Xalan<\/a> 2.7.1:\n<ul style=\"list-style: none;\">\n<li style=\"font-size: 90%;\">org.apache.xalan.processor.TransformerFactoryImpl<\/li>\n<li style=\"font-size: 90%;\">org.apache.xalan.xsltc.trax.TransformerFactoryImpl<\/li>\n<li style=\"font-size: 90%;\">org.apache.xalan.xsltc.trax.SmartTransformerFactoryImpl<\/li>\n<\/ul>\n<\/li>\n<li><a href=\"http:\/\/java.sun.com\/webservices\/docs\/1.4\/jaxp\/ReleaseNotes.html#manual\">Sun-Xalan<\/a> (an internal transformer factory present in Sun JDK 5 and 6):\n<ul style=\"list-style: none;\">\n<li style=\"font-size: 90%;\">com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl<\/li>\n<\/ul>\n<\/li>\n<li><a href=\"http:\/\/saxon.sourceforge.net\/\">Saxon<\/a> 8.7:\n<ul style=\"list-style: none;\">\n<li style=\"font-size: 90%;\">net.sf.saxon.TransformerFactoryImpl<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Here are the results for <a href=\"http:\/\/www.fileformat.info\/info\/unicode\/char\/1d49f\/index.htm\">Mathematical Script Capital D<\/a> character: <span style=\"font-size: 200%;\">\ud835\udc9f<\/span><\/p>\n<pre class=\"brush: xml; title: ; notranslate\" title=\"\">\nTRANSFORMER: org.apache.xalan.processor.TransformerFactoryImpl\nSOURCE:  &lt;!--?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?--&gt;\ud835\udc9f\nRESULT:  &lt;!--?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?--&gt;\ufffd\ufffd\nTRANSFORMER: org.apache.xalan.xsltc.trax.TransformerFactoryImpl\nSOURCE:  &lt;!--?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?--&gt;\ud835\udc9f\nRESULT:  &lt;!--?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?--&gt;\ufffd\ufffd\nTRANSFORMER: org.apache.xalan.xsltc.trax.SmartTransformerFactoryImpl\nSOURCE:  &lt;!--?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?--&gt;\ud835\udc9f\nRESULT:  &lt;!--?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?--&gt;\ufffd\ufffd\nTRANSFORMER: com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl\nSOURCE:  &lt;!--?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?--&gt;\ud835\udc9f\nRESULT:  &lt;!--?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; standalone=&quot;no&quot;?--&gt;\ud835\udc9f\nTRANSFORMER: net.sf.saxon.TransformerFactoryImpl\nSOURCE:  &lt;!--?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?--&gt;\ud835\udc9f\nRESULT:  &lt;!--?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?--&gt;\ud835\udc9f\n<\/pre>\n<p>Or, summarized in a table:<\/p>\n<table style=\"border: 1px solid black; border-collapse: collapse;\">\n<tbody>\n<tr>\n<th style=\"border: 1px solid black;\">&nbsp;<\/th>\n<th style=\"border: 1px solid black; padding: 4px; text-align: center;\">\ud835\udc9f<\/th>\n<th style=\"border: 1px solid black; padding: 4px; text-align: center;\">&amp;#119967;<\/th>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid black; padding: 0 4px 0 4px; font-weight: bold;\">Xalan 2.7.1<\/td>\n<td style=\"border: 1px solid black; padding: 4px 6px 4px 6px; text-align: center;\">&amp;#55349;&amp;#56479;<\/td>\n<td style=\"border: 1px solid black; padding: 4px 6px 4px 6px; text-align: center;\">&amp;#55349;&amp;#56479;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid black; padding: 0 4px 0 4px; font-weight: bold;\">Sun-Xalan (Sun JDK 1.5+)<\/td>\n<td style=\"border: 1px solid black; padding: 4px 6px 4px 6px; text-align: center;\">&amp;#119967;<\/td>\n<td style=\"border: 1px solid black; padding: 4px 6px 4px 6px; text-align: center;\">&amp;#119967;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: 1px solid black; padding: 0 4px 0 4px; font-weight: bold;\">Saxon 8.7<\/td>\n<td style=\"border: 1px solid black; padding: 4px 6px 4px 6px; text-align: center;\">\ud835\udc9f<\/td>\n<td style=\"border: 1px solid black; padding: 4px 6px 4px 6px; text-align: center;\">\ud835\udc9f<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The results were the same regardless of the parser implementation. <a href=\"http:\/\/xerces.apache.org\/xerces2-j\/\">Xerces<\/a> or <a href=\"http:\/\/saxon.sourceforge.net\/\">Saxon<\/a>.<\/p>\n<p><a href=\"http:\/\/xml.apache.org\/xalan-j\/\">Xalan&#8217;s<\/a> handling of UTF-8 multi-byte characters seems to be seriously flawed. <code>&amp;#55349;&amp;#56479;<\/code> are not valid UTF-8 characters and both <a href=\"http:\/\/xerces.apache.org\/xerces2-j\/\">Xerces<\/a> and <a href=\"http:\/\/saxon.sourceforge.net\/\">Saxon<\/a> parsers will throw SAXParseException when trying to parse documents that have them.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Different Java XSLT implementation have different handling of UTF-8 characters. Here is test code that parses UTF-8 XML into DOM document and then serializes it using a transformer. System.out.println(&quot; SOURCE: &quot; + source); DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance(parserClass, TestUnicode.class.getClassLoader()); Document document = builderFactory.newDocumentBuilder().parse(new InputSource(new StringReader(source))); TransformerFactory transformerFactory = TransformerFactory.newInstance(transformerClass, TestUnicode.class.getClassLoader()); StringWriter writer = new StringWriter(); transformerFactory.newTransformer().transform(new [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[15],"tags":[29,42,39,41,43,38,40,36,37],"class_list":["post-131","post","type-post","status-publish","format-standard","hentry","category-software","tag-java","tag-jdk","tag-saxon","tag-unicode","tag-utf-8","tag-xalan","tag-xerces","tag-xsl","tag-xslt"],"_links":{"self":[{"href":"https:\/\/www.dragishak.com\/index.php?rest_route=\/wp\/v2\/posts\/131","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.dragishak.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.dragishak.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.dragishak.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.dragishak.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=131"}],"version-history":[{"count":119,"href":"https:\/\/www.dragishak.com\/index.php?rest_route=\/wp\/v2\/posts\/131\/revisions"}],"predecessor-version":[{"id":424,"href":"https:\/\/www.dragishak.com\/index.php?rest_route=\/wp\/v2\/posts\/131\/revisions\/424"}],"wp:attachment":[{"href":"https:\/\/www.dragishak.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=131"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.dragishak.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=131"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.dragishak.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=131"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}