Skip to content

Commit

Permalink
Merge pull request #2150 from sparklemotion/2130-libxml2-html-strict-…
Browse files Browse the repository at this point in the history
…mode

HTML "strict" mode parsing

---

**What problem is this PR intended to solve?**

There are quite a few things being addressed in this PR, but the most significant bit is consistently obeying the `recover` parse option for HTML documents. As #2130 points out, the CRuby/libxml2 implementation was completely ignoring the `recover` parse option (also know as the `strict` option) for HTML documents. This PR makes the `recover` option behave identically for HTML documents as it does for XML documents.

Related, though, the JRuby implementation was incorrectly applying the `recover` parse option in HTML documents, instead using `noerror` and `nowarning` to determine whether to raise an exception. This has been brought in line with the behavior described above.

Also related, the `EncodingReader` class which detects encoding in HTML documents was silently swallowing document syntax errors if they occurred in the first "chunk" read from an IO object. This has also been fixed.

This PR also introduces some smaller changes:
- make JRuby and CRuby implementations consistent in how they handle comparing `XML::Node` to an `XML::Document` using the `<=>` operator
- introduce minitest-reporters to the main test suite
- skip some irrelevant tests, and restructure others to be faster
- fix the annoying JRuby encoding test that fails on some Java JDKs
- eating away at the edges of code formatting as I touched some of these rarely-touched files

**Have you included adequate test coverage?**

Yes.


**Does this change affect the behavior of either the C or the Java implementations?**

Yes, this changes the behavior of the `strict` (a.k.a. `norecover`) parse option in parsing HTML documents. I've chosen to classify this as a "bugfix that happens to change behavior that people may be depending upon" and so am introducing potentially breaking behavior into a minor release. Apologies to anyone who's inconvenienced by that.
  • Loading branch information
flavorjones authored Dec 29, 2020
2 parents 6035b0c + 771164d commit 519de9e
Show file tree
Hide file tree
Showing 15 changed files with 734 additions and 599 deletions.
20 changes: 16 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,15 +106,18 @@ See note below about CVE-2020-26247 in the "Changed" subsection entitled "XML::S

### Fixed

* HTML Parsing in "strict" mode (i.e., the `RECOVER` parse option not set) now correctly raises a `XML::SyntaxError` exception. Previously the value of the `RECOVER` bit was being ignored by CRuby and was misinterpreted by JRuby. [[#2130](https://github.com/sparklemotion/nokogiri/issues/2130)]
* The CSS `~=` operator now correctly handles non-space whitespace in the `class` attribute. commit e45dedd
* The switch to turn off the CSS-to-XPath cache is now thread-local, rather than being shared mutable state. [[#1935](https://github.com/sparklemotion/nokogiri/issues/1935)]
* The Node methods `add_previous_sibling`, `previous=`, `before`, `add_next_sibling`, `next=`, `after`, `replace`, and `swap` now correctly use their parent as the context node for parsing markup. These methods now also raise a `RuntimeError` if they are called on a node with no parent. [[nokogumbo#160](https://github.com/rubys/nokogumbo/issues/160)]
* [JRuby] XML::Schema XSD validation errors are captured in `XML::Schema#errors`. These errors were previously ignored.
* [JRuby] Standardize reading from IO like objects, including StringIO. [[#1888](https://github.com/sparklemotion/nokogiri/issues/1888), [#1897](https://github.com/sparklemotion/nokogiri/issues/1897)]
* [JRuby] Comparison of Node to Document with `Node#<=>` now matches CRuby/libxml2 behavior.
* [CRuby] Syntax errors are now correctly captured in `Document#errors` for short HTML documents. Previously the SAX parser used for encoding detection was clobbering libxml2's global error handler.
* [CRuby] Fixed installation on AIX with respect to `vasprintf`. [[#1908](https://github.com/sparklemotion/nokogiri/issues/1908)]
* [CRuby] On some platforms, avoid symbol name collision with glibc's `canonicalize`. [[#2105](https://github.com/sparklemotion/nokogiri/issues/2105)]
* [CRuby] Fixed Nokogumbo integration which broke in the v1.11.0 release candidates. [[#1788](https://github.com/sparklemotion/nokogiri/issues/1788)] (Thanks, [@stevecheckoway](https://github.com/stevecheckoway)!)
* [Windows Visual C++] Fixed compiler warnings and errors. [[#2061](https://github.com/sparklemotion/nokogiri/issues/2061), [#2068](https://github.com/sparklemotion/nokogiri/issues/2068)]
* [JRuby] Standardize reading from IO like objects, including StringIO. [[#1888](https://github.com/sparklemotion/nokogiri/issues/1888), [#1897](https://github.com/sparklemotion/nokogiri/issues/1897)]
* [JRuby] XML::Schema XSD validation errors are captured in `XML::Schema#errors`. These errors were previously ignored.
* [CRuby] Fixed Nokogumbo integration which broke in the v1.11.0 release candidates. [[#1788](https://github.com/sparklemotion/nokogiri/issues/1788)] (Thanks, [@stevecheckoway](https://github.com/stevecheckoway)!)
* [JRuby] Fixed document encoding regression in v1.11.0 release candidates. [[#2080](https://github.com/sparklemotion/nokogiri/issues/2080), [#2083](https://github.com/sparklemotion/nokogiri/issues/2083)] (Thanks, [@thbar](https://github.com/thbar)!)


Expand All @@ -139,6 +142,15 @@ Please note that this security fix was pushed into a new minor version, 1.11.x,
More information and instructions for enabling "trusted input" behavior in v1.11.0.rc4 and later is available at the [public advisory](https://github.com/sparklemotion/nokogiri/security/advisories/GHSA-vr8q-g5c7-m54m).


#### HTML parser now obeys the `strict` or `norecover` parsing option

(Also noted above in the "Fixed" section) HTML Parsing in "strict" mode (i.e., the `RECOVER` parse option not set) now correctly raises a `XML::SyntaxError` exception. Previously the value of the `RECOVER` bit was being ignored by CRuby and was misinterpreted by JRuby.

If you're using the default parser options, you will be unaffected by this fix. If you're passing `strict` or `norecover` to your HTML parser call, you may be surprised to see that the parser now fails to recover and raises a `XML::SyntaxError` exception. Given the number of HTML documents on the internet that libxml2 would consider to be ill-formed, this is probably not what you want, and you can omit setting that parse option to restore the behavior that you have been relying upon.

Apologies to anyone inconvenienced by this breaking bugfix being present in a minor release, but I felt it was appropriate to introduce this fix because it's straightforward to fix any code that has been relying on this buggy behavior.
#### `VersionInfo`, the output of `nokogiri -v`, and related constants
This release changes the metadata provided in `Nokogiri::VersionInfo` which also affects the output of `nokogiri -v`. Some related constants have also been changed. If you're using `VersionInfo` programmatically, or relying on constants related to underlying library versions, please read the detailed changes for `Nokogiri::VersionInfo` at [#2139](https://github.com/sparklemotion/nokogiri/issues/2139) and accept our apologies for the inconvenience.
Expand Down Expand Up @@ -300,7 +312,7 @@ This CVE's public notice is [#1915](https://github.com/sparklemotion/nokogiri/is
* [MRI] Address a memory leak when unparenting a DTD. [[#1784](https://github.com/sparklemotion/nokogiri/issues/1784)] (Thanks, [@stevecheckoway](https://github.com/stevecheckoway)!)
* [MRI] Use RbConfig::CONFIG instead of ::MAKEFILE_CONFIG to fix installations that use Makefile macros. [[#1820](https://github.com/sparklemotion/nokogiri/issues/1820)] (Thanks, [@nobu](https://github.com/nobu)!)
* [JRuby] Decrease large memory usage when making nested XPath queries. [[#1749](https://github.com/sparklemotion/nokogiri/issues/1749)]
* [JRuby] Fix how custom XPath function namespaces are inferred to be less naive. [#1890, #2148]
* [JRuby] Fix how custom XPath function namespaces are inferred to be less naive. [[#1890](https://github.com/sparklemotion/nokogiri/issues/1890), [#2148](https://github.com/sparklemotion/nokogiri/issues/2148)]
* [JRuby] Clarify exception message when custom XPath functions can't be resolved.
* [JRuby] Fix failing tests on JRuby 9.2.x
* [JRuby] Fix default namespaces in nodes reparented into a different document [[#1774](https://github.com/sparklemotion/nokogiri/issues/1774)]
Expand Down
8 changes: 7 additions & 1 deletion C_CODING_STYLE.rdoc
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,13 @@ feel free to update whitespace in the surrounding code.

This style can be automatically applied by running:

astyle --indent=spaces=2 --style=1tbs --keep-one-line-blocks $(ack -f --type=cpp --type=cc ext/nokogiri)
astyle --indent=spaces=2 \
--style=1tbs --keep-one-line-blocks \
--unpad-paren --pad-header --pad-oper --pad-comma \
--align-pointer=name \
--break-return-type --attach-return-type-decl \
--max-code-length=120 \
$(ack -f --type=cpp --type=cc ext/nokogiri)


= FUNCTION DECLARATION:
Expand Down
2 changes: 1 addition & 1 deletion ext/java/nokogiri/XmlNode.java
Original file line number Diff line number Diff line change
Expand Up @@ -720,7 +720,7 @@ public IRubyObject compare(ThreadContext context, IRubyObject other) {
// Do not touch this if, if it's not for a good reason.
if (node.getNodeType() == Node.DOCUMENT_NODE ||
otherNode.getNodeType() == Node.DOCUMENT_NODE) {
return context.runtime.newFixnum(-1);
return context.runtime.newFixnum(1);
}

try{
Expand Down
85 changes: 48 additions & 37 deletions ext/java/nokogiri/internals/HtmlDomParserContext.java
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,11 @@
import static nokogiri.internals.NokogiriHelpers.getNokogiriClass;
import static nokogiri.internals.NokogiriHelpers.isNamespace;
import static nokogiri.internals.NokogiriHelpers.stringOrNil;

import nokogiri.HtmlDocument;
import nokogiri.NokogiriService;
import nokogiri.XmlDocument;
import nokogiri.XmlSyntaxError;

import org.apache.xerces.xni.Augmentations;
import org.apache.xerces.xni.QName;
Expand Down Expand Up @@ -65,7 +67,7 @@
*/
public class HtmlDomParserContext extends XmlDomParserContext {

public HtmlDomParserContext(Ruby runtime, IRubyObject options) {
public HtmlDomParserContext(Ruby runtime, IRubyObject options) {
this(runtime, runtime.getNil(), options);
}

Expand All @@ -74,15 +76,6 @@ public HtmlDomParserContext(Ruby runtime, IRubyObject encoding, IRubyObject opti
java_encoding = NokogiriHelpers.getValidEncoding(encoding);
}

@Override
protected void initErrorHandler() {
if (options.strict) {
errorHandler = new NokogiriStrictErrorHandler(options.noError, options.noWarning);
} else {
errorHandler = new NokogiriNonStrictErrorHandler4NekoHtml(options.noError, options.noWarning);
}
}

@Override
protected void initParser(Ruby runtime) {
XMLParserConfiguration config = new HTMLConfiguration();
Expand All @@ -106,7 +99,7 @@ protected void initParser(Ruby runtime) {

@Override
public void setEncoding(String encoding) {
super.setEncoding(encoding);
super.setEncoding(encoding);
}

/**
Expand All @@ -118,6 +111,24 @@ public void enableDocumentFragment() {
setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);
}

@Override
public XmlDocument parse(ThreadContext context, RubyClass klass, IRubyObject url) {
XmlDocument xmlDoc = super.parse(context, klass, url);

// let's be consistent in how we handle RECOVER and NORECOVER (a.k.a. STRICT)
// https://github.com/sparklemotion/nokogiri/issues/2130
if (!options.recover && errorHandler.getErrors().size() > 0) {
XmlSyntaxError xmlSyntaxError = XmlSyntaxError.createXMLSyntaxError(context.runtime);
String exceptionMsg = String.format("%s: '%s'",
"Parser without recover option encountered error or warning",
errorHandler.getErrors().get(0));
xmlSyntaxError.setException(new Exception(exceptionMsg));
throw xmlSyntaxError.toThrowable();
}

return xmlDoc;
}

@Override
protected XmlDocument wrapDocument(ThreadContext context, RubyClass klass, Document document) {
HtmlDocument htmlDocument = new HtmlDocument(context.runtime, klass, document);
Expand Down Expand Up @@ -190,32 +201,32 @@ private ElementValidityCheckFilter(NokogiriErrorHandler errorHandler) {

// element names from xhtml1-strict.dtd
private static String[][] element_names = {
{"a", "abbr", "acronym", "address", "area"},
{"b", "base", "basefont", "bdo", "big", "blockquote", "body", "br", "button"},
{"caption", "cite", "code", "col", "colgroup"},
{"dd", "del", "dfn", "div", "dl", "dt"},
{"em"},
{"fieldset", "font", "form", "frame", "frameset"},
{}, // g
{"h1", "h2", "h3", "h4", "h5", "h6", "head", "hr", "html"},
{"i", "iframe", "img", "input", "ins"},
{}, // j
{"kbd"},
{"label", "legend", "li", "link"},
{"map", "meta"},
{"noframes", "noscript"},
{"object", "ol", "optgroup", "option"},
{"p", "param", "pre"},
{"q"},
{}, // r
{"s", "samp", "script", "select", "small", "span", "strike", "strong", "style", "sub", "sup"},
{"table", "tbody", "td", "textarea", "tfoot", "th", "thead", "title", "tr", "tt"},
{"u", "ul"},
{"var"},
{}, // w
{}, // x
{}, // y
{} // z
{"a", "abbr", "acronym", "address", "area"},
{"b", "base", "basefont", "bdo", "big", "blockquote", "body", "br", "button"},
{"caption", "cite", "code", "col", "colgroup"},
{"dd", "del", "dfn", "div", "dl", "dt"},
{"em"},
{"fieldset", "font", "form", "frame", "frameset"},
{}, // g
{"h1", "h2", "h3", "h4", "h5", "h6", "head", "hr", "html"},
{"i", "iframe", "img", "input", "ins"},
{}, // j
{"kbd"},
{"label", "legend", "li", "link"},
{"map", "meta"},
{"noframes", "noscript"},
{"object", "ol", "optgroup", "option"},
{"p", "param", "pre"},
{"q"},
{}, // r
{"s", "samp", "script", "select", "small", "span", "strike", "strong", "style", "sub", "sup"},
{"table", "tbody", "td", "textarea", "tfoot", "th", "thead", "title", "tr", "tt"},
{"u", "ul"},
{"var"},
{}, // w
{}, // x
{}, // y
{} // z
};

private static boolean isValid(final String name) {
Expand Down
Loading

0 comments on commit 519de9e

Please sign in to comment.