summaryrefslogtreecommitdiff
path: root/HTMLparser.c
Commit message (Collapse)AuthorAgeFilesLines
* error: Limit number of parser errorsNick Wellnhofer2022-12-271-0/+5
| | | | | | | Reporting errors is expensive and some abusive test cases can generate an error for each invalid input byte. This causes the parser to spend most of the time with error handling. Limit the number of errors and warnings to 100.
* Remove hacky heuristic from b2dc5675e94aa6b5557ba63f7d66b0f08dd17e4dAlex Richardson2022-12-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | Checking whether the context is close to the parent context by hardcoding 250 is not portable (I noticed tests were failing on Morello since the value is 288 there due to pointers being 128 bits). Instead we should ensure that the XML_VCTXT_USE_PCTXT flag is not set in cases where the user data is not actually a parser context (or ideally add a separate field but that would be an ABI break. From what I can see in the source, the XML_VCTXT_USE_PCTXT is only set if the userData field points to a valid context, and if this is not the case the flag should be cleared when changing userData rather than relying on the offset between the two. Looking at the history, I think d7cb33cf44aa688f24215c9cd398c1a26f0d25ff fixed most of the need for this workaround, but it looks like there are a few more locations that need updating; This commit changes two more places to set/clear/copy the XML_VCTXT_USE_PCTXT flag, so this heuristic should not be needed anymore. I've also drop two = NULL assignment in xmllint since this is not needed after a call to memset(). There was also an uninitialized vctxt.flags (and other fields) in `xmlShellValidate()`, which I've fixed by adding a memset() call.
* Avoid creating an out-of-bounds pointer by rewriting a checkAlex Richardson2022-12-011-1/+1
| | | | | | | Creating more than one-past-the-end pointers is undefined behaviour in C and while this code is unlikely to be miscompiled, I discovered that an out-of-bounds pointer is being created using UBSan on a CHERI-enabled system.
* html: Improve parsing of nested listsNick Wellnhofer2022-11-301-2/+0
| | | | | | | Allow ul/ol as immediate children of ul/ol. This is more in line with the HTML5 spec. Fixes #447.
* html: Fix htmlInitAutoClose documentationNick Wellnhofer2022-11-271-4/+1
|
* html: Fix check for end of comment in push parserNick Wellnhofer2022-11-201-6/+14
| | | | | Make sure to reset checkIndex. Handle case where "--" or "--!" is at the end of the buffer. Fix "avail" check in htmlParseOrTryFinish.
* parser: Rewrite push parser boundary checksNick Wellnhofer2022-11-201-51/+16
| | | | | | | | | | | Remove inaccurate xmlParseCheckTransition check. Remove non-incremental xmlParseGetLasts check. Add functions that check for several boundary constructs more accurately, keeping track of progress in ctxt->checkIndex. Fixes #439.
* Remove or annotate char castsNick Wellnhofer2022-09-011-2/+2
|
* Don't use sizeof(xmlChar) or sizeof(char)Nick Wellnhofer2022-09-011-7/+7
|
* Remove explicit integer castsNick Wellnhofer2022-09-011-10/+5
| | | | | | | | | | | | | | | | | | | | Remove explicit integer casts as final operation - in assignments - when passing arguments - when returning values Remove casts - to the same type - from certain range-bound values The main motivation is that these explicit casts don't change the result of operations and only render UBSan's implicit-conversion checks useless. Removing these casts allows UBSan to detect cases where truncation or sign-changes occur unexpectedly. Document some explicit casts as truncating and add a few missing ones.
* Make xmlNewSAXParserCtx take a const sax handlerNick Wellnhofer2022-09-011-3/+5
| | | | Also improve documentation.
* Consolidate private header filesNick Wellnhofer2022-08-261-2/+6
| | | | | | | | | | | Private functions were previously declared - in header files in the root directory - in public headers guarded with IN_LIBXML - in libxml.h - redundantly in source files that used them. Consolidate all private header files in include/private.
* Deprecate internal parser functionsNick Wellnhofer2022-08-251-0/+6
|
* Deprecate old HTML SAX APINick Wellnhofer2022-08-251-0/+4
|
* Introduce xmlNewSAXParserCtxt and htmlNewSAXParserCtxtNick Wellnhofer2022-08-241-27/+33
| | | | | Add API functions to create a parser context with a custom SAX handler without having to mess with ctxt->sax manually.
* Don't mess with parser options in htmlParseDocumentNick Wellnhofer2022-08-241-2/+1
| | | | | | Don't set ctxt->html. This member should already be initialized. Set ctxt->linenumbers in htmlCtxtUseOptions like the XML parser does.
* Remove useless call to htmlDefaultSAXHandlerInitNick Wellnhofer2022-08-241-2/+0
| | | | This function is already called from xmlInitParser.
* Remove htmlDefaultSAXHandler from non-SAX1 buildNick Wellnhofer2022-08-221-0/+2
| | | | This matches long-standing behavior of the XML counterpart.
* Don't initialize SAX handler in htmlReadMemoryNick Wellnhofer2022-08-221-3/+0
| | | | | The SAX handler is already initialized when creating the parser context.
* Fix htmlReadMemory mixing up XML and HTML functionsNick Wellnhofer2022-08-221-1/+1
| | | | Also see fe6890e2.
* Don't use default SAX handler to report unrelated errorsNick Wellnhofer2022-08-221-5/+0
|
* Fix HTML parser with threads and --without-legacyNick Wellnhofer2022-08-221-7/+4
| | | | | | | | | | | | | | | If the legacy functions are disabled, the default "V1" HTML SAX handler isn't initialized in threads other than the main thread. htmlInitParserCtxt would later use the empty V1 SAX handler, resulting in NULL documents. Change htmlInitParserCtxt to initialize the HTML SAX handler by calling xmlSAX2InitHtmlDefaultSAXHandler. This removes the ability to change the default handler but is more in line with the XML parser which initializes the SAX handler by calling xmlSAXVersion, ignoring the V1 default handler. Fixes #399.
* Use xmlStrlen in *CtxtReadDocNick Wellnhofer2022-08-201-5/+2
| | | | xmlStrlen handles buffers larger than INT_MAX more gracefully.
* Fix xmlCtxtReadDoc with encodingNick Wellnhofer2022-08-201-13/+4
| | | | | | | | | | | | | | xmlCtxtReadDoc used to create an input stream involving xmlNewStringInputStream. This would create a stream without an input buffer, causing problems with encodings (see #34). After commit aab584dc3, an error was returned even with UTF-8 encodings which happened to work before. Make xmlCtxtReadDoc call xmlCtxtReadMemory which doesn't suffer from these issues. Also fix htmlCtxtReadDoc. Fixes #397.
* Skip incorrectly opened HTML commentsNick Wellnhofer2022-08-021-60/+85
| | | | | | | | Commit 4fd69f3e fixed handling of '<' characters not followed by an ASCII letter. But a '<!' sequence followed by invalid characters should be treated as bogus comment and skipped. Fixes #380.
* Reduce indentation in HTMLparser.cNick Wellnhofer2022-08-021-199/+197
| | | | No functional change.
* Also reset nsNr in htmlCtxtResetNick Wellnhofer2022-07-281-0/+2
|
* Prevent integer-overflow in htmlSkipBlankChars() and xmlSkipBlankChars()David Kilzer2022-04-111-1/+2
| | | | | | | | | | | | | * HTMLparser.c: (htmlSkipBlankChars): * parser.c: (xmlSkipBlankChars): - Cap the return value at INT_MAX. - The commit range that OSS-Fuzz listed for the fix didn't make any changes to xmlSkipBlankChars(), so it seems like this issue may still exist. Found by OSS-Fuzz Issue 44803.
* Deprecate module init and cleanup functionsNick Wellnhofer2022-03-061-0/+3
| | | | | | These functions shouldn't be part of the public API. Most init functions are only thread-safe when called from xmlInitParser. Global variables should only be cleaned up by calling xmlCleanupParser.
* Remove unneeded #includesNick Wellnhofer2022-03-041-13/+0
|
* htmlParseComment: handle abruptly-closed commentsMike Dalessio2022-03-021-0/+11
| | | | | | See guidance provided on abrutply-closed comments here: https://html.spec.whatwg.org/multipage/parsing.html#parse-error-abrupt-closing-of-empty-comment
* Don't check for standard C89 headersNick Wellnhofer2022-03-021-4/+1
| | | | | | | | | | | | | | | | | | | | Don't check for - ctype.h - errno.h - float.h - limits.h - math.h - signal.h - stdarg.h - stdlib.h - string.h - time.h Stop including non-standard headers - malloc.h - strings.h
* Fix recovery from invalid HTML start tagsNick Wellnhofer2022-02-221-23/+21
| | | | | | | | | | Only try to parse a start tag if there's a '<' followed by an ASCII letter. This is more in line with HTML5 and the old behavior in recovery mode. Emit a literal '<' if the following character is invalid. Fixes #101. Fixes #339.
* Remove elfgcchack.hNick Wellnhofer2022-02-201-2/+0
| | | | | The same optimization can be enabled with -fno-semantic-interposition since GCC 5. clang has always used this option by default.
* Rework validation context flagsNick Wellnhofer2022-02-201-1/+1
| | | | | | | | | | | | | Use a bitmask instead of magic values to - keep track whether the validation context is part of a parser context - keep track whether xmlValidateDtdFinal was called This allows to add addtional flags later. Note that this deliberately changes the name of a public struct member, assuming that this was always private data never to be used by client code.
* Also register HTML document nodesNick Wellnhofer2022-02-011-0/+2
| | | | Fixes #196.
* Fix htmlReadFd, which was using a mix of xml and html context functionsFinn Barber2022-01-161-5/+7
|
* Fix memory leak in xmlFreeParserInputBufferDavid King2022-01-161-0/+1
| | | | | | Found by Coverity. https://bugzilla.redhat.com/show_bug.cgi?id=1938806
* Different approach to fix quadratic behavior in HTML push parserNick Wellnhofer2022-01-101-1/+13
| | | | | | | | The old approach introduced a regression, see issue #312 and the previous commit. Disable code that tries to recover from invalid start tags. This only affects "recovery" mode. Add a comment outlining a better fix in accordance with the HTML5 spec.
* Fix regression when parsing invalid HTML tags in push modeNick Wellnhofer2022-01-101-24/+4
| | | | | | | | | Revert part of commit 173a0830 that changed behavior when parsing malformed start tags with the push parser. This reintroduces quadratic behavior in recovery mode which will be worked around in the next commit. Fixes #312.
* Fix regression parsing public IDs literals in HTMLNick Wellnhofer2022-01-101-1/+1
| | | | | | | Fix regression introduced when reworking htmlParsePubidLiteral in commit 93ce33c2. Fixes #318.
* Fix htmlTagLookupNick Wellnhofer2021-05-061-2/+2
| | | | | | | | Fix regression introduced with b25acce8. Some users like libxslt may call the HTML output functions on documents with uppercase tag names, so we must keep case-insensitive string comparison. Fixes #248.
* Fix duplicate xmlStrEqual calls in htmlParseEndTagNick Wellnhofer2021-03-041-6/+4
|
* Speed up htmlCheckAutoCloseNick Wellnhofer2021-03-041-136/+280
| | | | Switch to binary search.
* Speed up htmlTagLookupNick Wellnhofer2021-03-041-7/+13
| | | | | | Switch to binary search. This is the first time bsearch is used in the libxml2 code base. But it's a standard library function since C89 and should be portable.
* Revert "Improve HTML fuzzer stability"Nick Wellnhofer2021-02-221-4/+0
| | | | This reverts commit de1b51eddcc17fd7ed1bbcc6d5d7d529407dfbe2.
* Improve HTML fuzzer stabilityNick Wellnhofer2021-02-221-0/+4
| | | | | Call htmlInitAutoClose during fuzzer initialization to fix stability issue. Leave a note concerning problems with this function.
* Fix slow parsing of HTML with encoding errorsNick Wellnhofer2021-02-201-2/+16
| | | | | | | | | | | | | | | | | | | | | Under certain circumstances, the HTML parser would try to guess and switch input encodings multiple times, leading to slow processing of documents with encoding errors. The repeated scanning of the input buffer when guessing encodings could even lead to quadratic behavior. The code htmlCurrentChar probably assumed that if there's an encoding handler, it is guaranteed to produce valid UTF-8. This holds true in general, but if the detected encoding was "UTF-8", the UTF8ToUTF8 encoding handler simply invoked memcpy without checking for invalid UTF-8. This still must be fixed, preferably by not using this handler at all. Also leave a note that switching encodings twice seems impossible to implement correctly. Add a check when handling UTF-8 encoding errors in htmlCurrentChar to avoid this situation, even if encoders produce invalid UTF-8. Found by OSS-Fuzz.
* Fix infinite loop in HTML parser introduced with recent commitsNick Wellnhofer2021-02-071-1/+2
| | | | | | | Check for XML_PARSER_EOF to avoid an infinite loop introduced with recent changes to the HTML push parser. Found by OSS-Fuzz.
* use new htmlParseLookupCommentEnd to find comment endsMike Dalessio2020-12-161-9/+37
| | | | | | | | | Note that the caret in error messages generated during comment parsing may have moved by one byte. See guidance provided on incorrectly-closed comments here: https://html.spec.whatwg.org/multipage/parsing.html#parse-error-incorrectly-closed-comment