Archive for May 2010


Solr DisMax parser and stop words

May 25th, 2010 — 11:49pm

If you want to use DisMax parser in Solr you need to be careful how to index the fields that DisMax will be using.

If you mix fields that filter out stop words (plain text) and fields that do not filter out stop words (like author names), your simple queries might end up with no results.

By default, DisMax will display only results that contain all the words from your query string. If your query has stop words like “ants of madagascar”, stop word “of” might not be found in any of the fields – it’s not in author names and it’s filtered out in article body – and you will get zero results.

Possible workarounds:

  • Relax Minimum Match (mm) requirement.
    Downside: Lowering mm will increase number of results. mm of 50% on “ants madagascar” will return all documents that have “ants” and all docs that have “madagascar” in them.
  • Do not filter out stop words.
    Downside: Your index can get large and you might get large number of less relevant results.
  • Use other indexing schemes like N-Grams.

This article explains the details.

Also see this and this discussion.

Comment » | Software

XSLT Unicode Horror

May 18th, 2010 — 10:50pm

Different Java XSLT implementation have different handling of UTF-8 characters. Here is test code that parses UTF-8 XML into DOM document and then serializes it using a transformer.

System.out.println("    SOURCE:  " + source);
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance(parserClass, TestUnicode.class.getClassLoader());
Document document = builderFactory.newDocumentBuilder().parse(new InputSource(new StringReader(source)));
TransformerFactory transformerFactory = TransformerFactory.newInstance(transformerClass, TestUnicode.class.getClassLoader());
StringWriter writer = new StringWriter();
transformerFactory.newTransformer().transform(new DOMSource(document), new StreamResult(writer));
System.out.println("    RESULT:  " + writer.toString());

I tested following transformers:

  • Xalan 2.7.1:
    • org.apache.xalan.processor.TransformerFactoryImpl
    • org.apache.xalan.xsltc.trax.TransformerFactoryImpl
    • org.apache.xalan.xsltc.trax.SmartTransformerFactoryImpl
  • Sun-Xalan (an internal transformer factory present in Sun JDK 5 and 6):
    • com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl
  • Saxon 8.7:
    • net.sf.saxon.TransformerFactoryImpl

Here are the results for Mathematical Script Capital D character: 𝒟

TRANSFORMER: org.apache.xalan.processor.TransformerFactoryImpl
SOURCE:  <!--?xml version="1.0" encoding="UTF-8"?-->𝒟
RESULT:  <!--?xml version="1.0" encoding="UTF-8"?-->��
TRANSFORMER: org.apache.xalan.xsltc.trax.TransformerFactoryImpl
SOURCE:  <!--?xml version="1.0" encoding="UTF-8"?-->𝒟
RESULT:  <!--?xml version="1.0" encoding="UTF-8"?-->��
TRANSFORMER: org.apache.xalan.xsltc.trax.SmartTransformerFactoryImpl
SOURCE:  <!--?xml version="1.0" encoding="UTF-8"?-->𝒟
RESULT:  <!--?xml version="1.0" encoding="UTF-8"?-->��
TRANSFORMER: com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl
SOURCE:  <!--?xml version="1.0" encoding="UTF-8"?-->𝒟
RESULT:  <!--?xml version="1.0" encoding="UTF-8" standalone="no"?-->𝒟
TRANSFORMER: net.sf.saxon.TransformerFactoryImpl
SOURCE:  <!--?xml version="1.0" encoding="UTF-8"?-->𝒟
RESULT:  <!--?xml version="1.0" encoding="UTF-8"?-->𝒟

Or, summarized in a table:

  𝒟 &#119967;
Xalan 2.7.1 &#55349;&#56479; &#55349;&#56479;
Sun-Xalan (Sun JDK 1.5+) &#119967; &#119967;
Saxon 8.7 𝒟 𝒟

The results were the same regardless of the parser implementation. Xerces or Saxon.

Xalan’s handling of UTF-8 multi-byte characters seems to be seriously flawed. &#55349;&#56479; are not valid UTF-8 characters and both Xerces and Saxon parsers will throw SAXParseException when trying to parse documents that have them.

5 comments » | Software

Back to top