summaryrefslogtreecommitdiff
path: root/parserInternals.c
Commit message (Collapse)AuthorAgeFilesLines
* parser: Deprecate more internal functionsNick Wellnhofer2023-04-261-0/+8
|
* parser: Fix regression in memory pull parser with encodingNick Wellnhofer2023-04-191-1/+11
| | | | | | | Revert another change from commit 98840d40. Decode the whole buffer when reading from memory and switching to the initial encoding. Add some comments about potential improvements.
* parser: Fix regression when switching input encodingsNick Wellnhofer2023-04-131-4/+12
| | | | | | | | Revert some changes from commit 98840d40. WebKit/Chromium can actually switch from ISO-8859-1 to UTF-16 in the middle of parsing. This is a bad idea, but we have to keep supporting this use case.
* parser: Don't grow push parser buffersNick Wellnhofer2023-04-121-0/+3
| | | | | This should fix a short-lived regression when push parsing with encodings.
* parser: Halt parser if switching encodings failsNick Wellnhofer2023-03-301-0/+2
| | | | | | Avoids buffer overread in htmlParseHTMLAttribute. Found by OSS-Fuzz.
* parser: Fix buffer overread in xmlDetectEBCDICNick Wellnhofer2023-03-261-1/+2
| | | | Short-lived regression found by OSS-Fuzz.
* parser: Grow input buffer earlier when reading charactersNick Wellnhofer2023-03-211-2/+2
| | | | Make more bytes available after invoking CUR_CHAR or NEXT.
* parser: Rework EBCDIC code page detectionNick Wellnhofer2023-03-211-108/+76
| | | | | | | | | | To detect EBCDIC code pages, we used to switch the encoding twice and had to be very careful not to decode data after the XML declaration before the second switch. This relied on a hard-coded expected size of the XML declaration and was complicated and unreliable. Now we convert the first 200 bytes to EBCDIC-US and parse the encoding declaration manually.
* parser: Rework shrinking of input buffersNick Wellnhofer2023-03-211-14/+2
| | | | | | | | | | Don't try to grow the input buffer in xmlParserShrink. This makes sure that no memory allocations are made and the function always succeeds. Remove unnecessary invocations of SHRINK. Invoke SHRINK at the end of DTD parsing loops. Shrink before growing.
* parser: More fixes to xmlParserGrowNick Wellnhofer2023-03-161-20/+5
| | | | | xmlHaltParser must be called after reporting an error. Switch to xmlBufSetInputBaseCur.
* malloc-fail: Fix buffer overread when reading from inputNick Wellnhofer2023-03-151-36/+25
| | | | Found by OSS-Fuzz, see #344.
* parser: Fix short-lived regression causing infinite loopsNick Wellnhofer2023-03-141-9/+40
| | | | | Fix 3eb6bf03. We really have to halt the parser, so the input buffer gets reset.
* parser: Deprecate some parser input functionsNick Wellnhofer2023-03-131-0/+2
|
* parser: Stop calling xmlParserInputShrinkNick Wellnhofer2023-03-131-0/+57
| | | | | Introduce xmlParserShrink which takes a parser context to simplify error handling.
* malloc-fail: Fix null deref in xmlParserInputShrinkNick Wellnhofer2023-03-131-0/+7
| | | | Found by OSS-Fuzz.
* parser: Stop calling xmlParserInputGrowNick Wellnhofer2023-03-121-10/+60
| | | | | Introduce xmlParserGrow which takes a parser context to simplify error handling.
* malloc-fail: Fix null deref if growing input buffer failsNick Wellnhofer2023-01-241-0/+6
| | | | | | Also add some error checks. Found with libFuzzer, see #344.
* parser: Fix integer overflow of input IDNick Wellnhofer2022-12-221-1/+6
| | | | | | | Applies a patch from Chromium. Also stop incrementing input ID of subcontexts. This isn't necessary. Fixes #465.
* entities: Stop counting entitiesNick Wellnhofer2022-12-211-1/+0
| | | | This was only used in the old version of xmlParserEntityCheck.
* entities: Rework entity amplification checksNick Wellnhofer2022-12-211-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit implements robust detection of entity amplification attacks, better known as the "billion laughs" attack. We now limit the size of the document after substitution of entities to 10 times the size before expansion. This guarantees linear behavior by definition. There already was a similar check before, but the accounting of "sizeentities" (size of external entities) and "sizeentcopy" (size of all copies created by entity references) wasn't accurate. We also need saturation arithmetic since we're historically limited to "unsigned long" which is 32-bit on many platforms. A maximum of 10 MB of substitutions is always allowed. This should make use cases like DITA work which have caused problems in the past. The old checks based on the number of entities were removed. This is accounted for by adding a fixed cost to each entity reference. Entity amplification checks are now enabled even if XML_PARSE_HUGE is set. This option is mainly used to allow larger text nodes. Most users were unaware that it also disabled entity expansion checks. Some of the limits might be adjusted later. If this change turns out to affect legitimate use cases, we can add a separate parser option to disable the checks. Fixes #294. Fixes #345.
* parser: Fix progress check when parsing character dataNick Wellnhofer2022-11-211-1/+1
| | | | Skip over zero bytes to guarantee progress. Short-lived regression.
* parser: Fix 'consumed' accounting when switching encodingsNick Wellnhofer2022-11-201-0/+1
|
* io: Fix a few integer overflows in I/O statisticsNick Wellnhofer2022-11-201-4/+12
| | | | | There are still many places where arithmetic on "consumed" stats isn't checked for overflow, affecting platforms with a 32-bit long type.
* io: Rearrange code in xmlSwitchInputEncodingIntNick Wellnhofer2022-11-201-104/+96
| | | | No functional change.
* io: Remove xmlInputReadCallbackNopNick Wellnhofer2022-11-201-1/+2
| | | | | | | | | | | In some cases, for example when using encoders, the read callback was set to NULL, in other cases it was set to xmlInputReadCallbackNop. xmlGROW only tested for xmlInputReadCallbackNop, resulting in errors when parsing large encoded content from memory. Always use a NULL callback for memory buffers to avoid ambiguities. Fixes #262.
* io: Check for memory buffer early in xmlParserInputGrowNick Wellnhofer2022-11-131-4/+4
|
* Remove or annotate char castsNick Wellnhofer2022-09-011-4/+4
|
* Remove explicit integer castsNick Wellnhofer2022-09-011-11/+11
| | | | | | | | | | | | | | | | | | | | Remove explicit integer casts as final operation - in assignments - when passing arguments - when returning values Remove casts - to the same type - from certain range-bound values The main motivation is that these explicit casts don't change the result of operations and only render UBSan's implicit-conversion checks useless. Removing these casts allows UBSan to detect cases where truncation or sign-changes occur unexpectedly. Document some explicit casts as truncating and add a few missing ones.
* Make xmlNewSAXParserCtx take a const sax handlerNick Wellnhofer2022-09-011-4/+5
| | | | Also improve documentation.
* Consolidate private header filesNick Wellnhofer2022-08-261-2/+5
| | | | | | | | | | | Private functions were previously declared - in header files in the root directory - in public headers guarded with IN_LIBXML - in libxml.h - redundantly in source files that used them. Consolidate all private header files in include/private.
* Mark more functions setting globals as deprecatedNick Wellnhofer2022-08-241-0/+4
|
* Mark more parser functions as deprecatedNick Wellnhofer2022-08-241-1/+16
| | | | No compiler warnings generated yet.
* Introduce xmlNewSAXParserCtxt and htmlNewSAXParserCtxtNick Wellnhofer2022-08-241-8/+55
| | | | | Add API functions to create a parser context with a custom SAX handler without having to mess with ctxt->sax manually.
* Use xmlStrlen in xmlNewStringInputStreamNick Wellnhofer2022-08-201-1/+1
| | | | xmlStrlen handles buffers larger than INT_MAX more gracefully.
* Create stream with buffer in xmlNewStringInputStreamNick Wellnhofer2022-08-201-4/+11
| | | | | | | Create an input stream with a buffer in xmlNewStringInputStream. Otherwise, switching encodings won't work. See #34.
* Clean up encoding switching codeNick Wellnhofer2022-04-021-127/+23
| | | | | | | | - Remove xmlSwitchToEncodingInt which was basically just a wrapper around xmlSwitchInputEncodingInt. - Simplify xmlSwitchEncoding. - Improve error handling in xmlSwitchInputEncodingInt. - Deprecate xmlSwitchInputEncoding.
* Fix calls to deprecated init/cleanup functionsNick Wellnhofer2022-03-291-1/+1
| | | | Only use xmlInitParser/xmlCleanupParser.
* Avoid arithmetic on freed pointersNick Wellnhofer2022-03-061-36/+9
|
* Remove unneeded #includesNick Wellnhofer2022-03-041-13/+0
|
* Don't check for standard C89 headersNick Wellnhofer2022-03-021-4/+1
| | | | | | | | | | | | | | | | | | | | Don't check for - ctype.h - errno.h - float.h - limits.h - math.h - signal.h - stdarg.h - stdlib.h - string.h - time.h Stop including non-standard headers - malloc.h - strings.h
* Remove useless __CYGWIN__ checksNick Wellnhofer2022-02-281-1/+1
| | | | | | | | | | From what I can tell, some really early Cygwin versions from around 1998-2000 used to erroneously define _WIN32. This was eventually fixed, but these days, the `defined(_WIN32) && !defined(__CYGWIN__)` idiom is unnecessary. Now, we only check for __CYGWIN__ in xmlexports.h when deciding whether to use __declspec.
* Remove elfgcchack.hNick Wellnhofer2022-02-201-2/+0
| | | | | The same optimization can be enabled with -fno-semantic-interposition since GCC 5. clang has always used this option by default.
* Rework validation context flagsNick Wellnhofer2022-02-201-1/+1
| | | | | | | | | | | | | Use a bitmask instead of magic values to - keep track whether the validation context is part of a parser context - keep track whether xmlValidateDtdFinal was called This allows to add addtional flags later. Note that this deliberately changes the name of a public struct member, assuming that this was always private data never to be used by client code.
* Fix memory leak in xmlNewInputFromFileDavid King2022-01-161-1/+3
| | | | | | Found by Coverity. https://bugzilla.redhat.com/show_bug.cgi?id=1938806
* Fix slow parsing of HTML with encoding errorsNick Wellnhofer2021-02-201-0/+5
| | | | | | | | | | | | | | | | | | | | | Under certain circumstances, the HTML parser would try to guess and switch input encodings multiple times, leading to slow processing of documents with encoding errors. The repeated scanning of the input buffer when guessing encodings could even lead to quadratic behavior. The code htmlCurrentChar probably assumed that if there's an encoding handler, it is guaranteed to produce valid UTF-8. This holds true in general, but if the detected encoding was "UTF-8", the UTF8ToUTF8 encoding handler simply invoked memcpy without checking for invalid UTF-8. This still must be fixed, preferably by not using this handler at all. Also leave a note that switching encodings twice seems impossible to implement correctly. Add a check when handling UTF-8 encoding errors in htmlCurrentChar to avoid this situation, even if encoders produce invalid UTF-8. Found by OSS-Fuzz.
* Stop counting nbChars in parser contextNick Wellnhofer2020-08-091-6/+0
| | | | The value was inaccurate and never used.
* Fix typosNick Wellnhofer2020-03-081-3/+3
| | | | Resolves #133.
* Large batch of typo fixesJared Yanovich2019-09-301-4/+4
| | | | Closes #109.
* Fix memory leak in xmlSwitchInputEncodingInt error pathNick Wellnhofer2018-11-221-0/+10
| | | | Found by OSS-Fuzz.
* Revert "Change calls to xmlCharEncInput to set flush false"Nick Wellnhofer2018-03-171-1/+1
| | | | | | | This reverts commit 6e6ae5daa6cd9640c9a83c1070896273e9b30d14 which broke decoding of larger documents with ICU. See https://bugs.chromium.org/p/chromium/issues/detail?id=820163