summaryrefslogtreecommitdiff
path: root/encoding.c
Commit message (Collapse)AuthorAgeFilesLines
* encoding: Fix compiler warning in ICU buildNick Wellnhofer2023-04-171-1/+1
|
* encoding: Fix error code in asciiToUTF8Nick Wellnhofer2023-03-261-1/+1
| | | | | | Use correct error code when invalid ASCII bytes are encountered. Found by OSS-Fuzz.
* parser: Rework EBCDIC code page detectionNick Wellnhofer2023-03-211-180/+3
| | | | | | | | | | To detect EBCDIC code pages, we used to switch the encoding twice and had to be very careful not to decode data after the XML declaration before the second switch. This relied on a hard-coded expected size of the XML declaration and was complicated and unreliable. Now we convert the first 200 bytes to EBCDIC-US and parse the encoding declaration manually.
* malloc-fail: Check for malloc failure in xmlFindCharEncodingHandlerNick Wellnhofer2023-02-171-0/+12
| | | | | | Don't return encoding handlers with a NULL name. Found with libFuzzer, see #344.
* malloc-fail: Fix leak of xmlCharEncodingHandlerNick Wellnhofer2023-02-171-1/+0
| | | | | | Also free handler if its name is NULL. Found with libFuzzer, see #344.
* encoding: Cast toupper argument to unsigned charNick Wellnhofer2023-02-171-5/+5
| | | | | | | Fixes undefined behavior. Also cast return value explicitly to fix implicit-integer-sign-change checks.
* malloc-fail: Fix null deref if growing input buffer failsNick Wellnhofer2023-01-241-1/+2
| | | | | | Also add some error checks. Found with libFuzzer, see #344.
* encoding.c: Fix for documentation generatorNick Wellnhofer2022-12-081-0/+2
| | | | Top-level macro invocations throw off the documentation parser.
* encoding: Make init function privateNick Wellnhofer2022-11-271-7/+11
|
* encoding: Remove unused variable xmlDefaultCharEncodingHandlerNick Wellnhofer2022-11-271-10/+2
|
* encoding: Allocate default handlers staticallyNick Wellnhofer2022-11-241-84/+116
|
* buf: Deprecate static/immutable buffersNick Wellnhofer2022-11-201-4/+2
|
* Remove explicit integer castsNick Wellnhofer2022-09-011-11/+11
| | | | | | | | | | | | | | | | | | | | Remove explicit integer casts as final operation - in assignments - when passing arguments - when returning values Remove casts - to the same type - from certain range-bound values The main motivation is that these explicit casts don't change the result of operations and only render UBSan's implicit-conversion checks useless. Removing these casts allows UBSan to detect cases where truncation or sign-changes occur unexpectedly. Document some explicit casts as truncating and add a few missing ones.
* Consolidate private header filesNick Wellnhofer2022-08-261-33/+8
| | | | | | | | | | | Private functions were previously declared - in header files in the root directory - in public headers guarded with IN_LIBXML - in libxml.h - redundantly in source files that used them. Consolidate all private header files in include/private.
* xmlBufAvail() should return length without including a byte for NUL terminatorDavid Kilzer2022-05-251-10/+4
| | | | | | | | | | | | | | | | * buf.c: (xmlBufAvail): - Return the number of bytes available in the buffer, but do not include a byte for the NUL terminator so that it is reserved. * encoding.c: (xmlCharEncFirstLineInput): (xmlCharEncInput): (xmlCharEncOutput): * xmlIO.c: (xmlOutputBufferWriteEscape): - Remove code that subtracts 1 from the return value of xmlBufAvail(). It was implemented inconsistently anyway.
* Mark more static data as `const`David Kilzer2022-04-071-15/+15
| | | | | | | | | Similar to 8f5710379, mark more static data structures with `const` keyword. Also fix placement of `const` in encoding.c. Original patch by Sarah Wilkin.
* Deprecate module init and cleanup functionsNick Wellnhofer2022-03-061-0/+8
| | | | | | These functions shouldn't be part of the public API. Most init functions are only thread-safe when called from xmlInitParser. Global variables should only be cleaned up by calling xmlCleanupParser.
* Fix memory leak in xmlFindCharEncodingHandlerNick Wellnhofer2022-03-051-0/+4
| | | | | | | Fix memory leak in an unlikely error condition. Thanks to Wentao Liang for the report. Fixes #342.
* Remove ICONV_CONST testNick Wellnhofer2022-03-041-1/+4
| | | | We can simply cast the offending pointer to (void *).
* Don't check for standard C89 headersNick Wellnhofer2022-03-021-7/+2
| | | | | | | | | | | | | | | | | | | | Don't check for - ctype.h - errno.h - float.h - limits.h - math.h - signal.h - stdarg.h - stdlib.h - string.h - time.h Stop including non-standard headers - malloc.h - strings.h
* Don't include ICU headers in public headersNick Wellnhofer2022-03-011-0/+14
| | | | There's no need to make these implementation details public.
* Fix unused variable warnings with disabled featuresNick Wellnhofer2022-02-221-0/+3
|
* Remove elfgcchack.hNick Wellnhofer2022-02-201-2/+0
| | | | | The same optimization can be enabled with -fno-semantic-interposition since GCC 5. clang has always used this option by default.
* Fix integer conversion warning in xmlIconvWrapperNick Wellnhofer2022-01-251-2/+2
| | | | | Use size_t for return value of iconv(3) to avoid an UBSan integer conversion warning.
* Fix random dropping of characters on dumping ASCII encoded XMLMohammad Razavi2022-01-161-1/+1
| | | | | | | | | | | | | | | | | | | | Fix a bug in xmlCharEncOutput return value which will cause xmlNodeDumpOutput to drop characters randomly. xmlCharEncOutput returns zero if the length of the input buffer is zero but ignores the fact that it may already encoded the input buffer and the input's length is zero due to the fact that xmlEncOutputChunk returned -2 errors and underlying code tries to fix the error by encoding the input. xmlCharEncOutput is collecting the number of bytes written to the output buffer but is returning zero instead of the total number of bytes in this situation. This commit will fix this issue by returning the total number of bytes instead. So the xmlNodeDumpOutput will also continue writing and will not stop due to the fact that it mistakenly thinks the output buffer is not changed in that iteration. Fixes #314
* Fix parse failure when 4-byte character in UTF-16 BE is split across a chunkDavid Kilzer2022-01-161-11/+12
| | | | | | | | | | | | | | | | | | | | | | | This makes the logic in UTF16BEToUTF8() match UTF16LEToUTF8(). * encoding.c: (UTF16LEToUTF8): - Fix comment to describe what the code does. (UTF16BEToUTF8): - Fix undefined behavior which was applied to UTF16LEToUTF8() in 2f9382033e. - Add bounds check to while() loop which was applied to UTF16LEToUTF8() in be803967db. - Do not return -2 when (in >= inend) to fix the bug. This was applied to UTF16LEToUTF8() in 496a1cf592. - Inline (<< 8) statements to match UTF16LEToUTF8(). Add the following tests and results: test/text-4-byte-UTF-16-BE-offset.xml test/text-4-byte-UTF-16-BE.xml test/text-4-byte-UTF-16-LE-offset.xml test/text-4-byte-UTF-16-LE.xml
* Remove unused variable in xmlCharEncOutFuncDavid King2021-05-231-3/+0
| | | | | | | | | | Fixes a compiler warning: encoding.c: In function 'xmlCharEncOutFunc__internal_alias': encoding.c:2632:9: warning: unused variable 'output' [-Wunused-variable] 2632 | int output = 0; https://gitlab.gnome.org/GNOME/libxml2/-/issues/254
* Fix slow parsing of HTML with encoding errorsNick Wellnhofer2021-02-201-0/+5
| | | | | | | | | | | | | | | | | | | | | Under certain circumstances, the HTML parser would try to guess and switch input encodings multiple times, leading to slow processing of documents with encoding errors. The repeated scanning of the input buffer when guessing encodings could even lead to quadratic behavior. The code htmlCurrentChar probably assumed that if there's an encoding handler, it is guaranteed to produce valid UTF-8. This holds true in general, but if the detected encoding was "UTF-8", the UTF8ToUTF8 encoding handler simply invoked memcpy without checking for invalid UTF-8. This still must be fixed, preferably by not using this handler at all. Also leave a note that switching encodings twice seems impossible to implement correctly. Add a check when handling UTF-8 encoding errors in htmlCurrentChar to avoid this situation, even if encoders produce invalid UTF-8. Found by OSS-Fuzz.
* encoding: fix memleak in xmlRegisterCharEncodingHandler()Xiaoming Ni2020-12-071-2/+11
| | | | | | | | | | | | | | The return type of xmlRegisterCharEncodingHandler() is void. The invoker cannot determine whether xmlRegisterCharEncodingHandler() is executed successfully. when nbCharEncodingHandler >= MAX_ENCODING_HANDLERS, the "handler" is not added to the array "handlers". As a result, the memory of "handler" cannot be managed and released: memory leakage. so add "xmlfree(handler)" to fix memory leakage on the failure branch of xmlRegisterCharEncodingHandler(). Reported-by: wuqing <wuqing30@huawei.com> Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
* Fix building with ICU 68.Frederik Seiffert2020-11-191-1/+1
| | | | | | ICU 68 no longer defines the TRUE macro. Closes #204.
* Fix return values and documentation in encoding.cNick Wellnhofer2020-07-061-12/+53
| | | | | | | | Make xmlEncInputChunk and xmlEncOutputChunk return 0 on success and never a positive value. Make xmlCharEncFirstLineInt, xmlCharEncFirstLineInt and xmlCharEncOutFunc return the number of bytes written.
* Fix undefined behavior in UTF16LEToUTF8Nick Wellnhofer2020-06-151-1/+6
| | | | | | Don't perform arithmetic on null pointer. Found with libFuzzer and UBSan.
* Fix return value of xmlCharEncOutputNick Wellnhofer2020-06-151-3/+3
| | | | | | | | Commit 407b393d introduced a regression caused by xmlCharEncOutput returning 0 in case of success instead of the number of bytes written. Always use its return value for nbchars in xmlOutputBufferWrite. Fixes #166.
* Fix typosNick Wellnhofer2020-03-081-1/+1
| | | | Resolves #133.
* Large batch of typo fixesJared Yanovich2019-09-301-9/+9
| | | | Closes #109.
* Remove a misleading line from xmlCharEncOutputAndrey Bienkowski2018-07-231-2/+0
| | | | | | | | | | | Closes: https://bugzilla.gnome.org/show_bug.cgi?id=793028 It seams this line was accidentally copied over from xmlCharEncOutFunc. In xmlCharEncOutput output is a pointer so incrementing it by ret can point it where it wasn't supposed to be pointing. Luckily the current implementation doesn't dereference the pointer after advancing it. Signed-off-by: Daniel Veillard <veillard@redhat.com>
* Fix unused parameter warning without ICUNick Wellnhofer2017-11-091-0/+1
|
* Fixed ICU to set flush correctly and provide pivot buffer.Joel Hockey2017-11-041-21/+25
| | | | | | | | | | By always setting flush=TRUE when doing multiple reads, ICU will not correctly handle truncated utf8 chars across read boundaries. The fix is to set flush=TRUE only on final read, and to provide a pivot buffer which is maintained by libxml between calls to ucnv_convertEx.
* Fix pathological performance when outputting charrefsNick Wellnhofer2017-06-191-70/+59
| | | | | | | | | | | | | | If a character can't be represented in the output encoding, it is converted to a character reference. This used to to replace the character in the input stream by calling xmlBufAddHead or xmlBufferAddHead. These functions shifted the entire input array around, leading to quadratic performance when converting a run of non-representable characters. This is most pronounced when dumping to memory. Output the charref directly instead. Found with libFuzzer.
* Deduplicate code in encoding.cNick Wellnhofer2017-06-191-312/+153
| | | | | Introduce static functions xmlEncInputChunk and xmlEncOutputChunk that handle the internal/iconv/ICU branching.
* Fix some format string warnings with possible format string vulnerabilityDavid Kilzer2016-05-231-1/+1
| | | | | | | | For https://bugzilla.gnome.org/show_bug.cgi?id=761029 Decorate every method in libxml2 with the appropriate LIBXML_ATTR_FORMAT(fmt,args) macro and add some cleanups following the reports.
* Avoid a possibility of dangling encoding handlerGaurav2013-11-291-2/+14
| | | | | | | | | | | For https://bugzilla.gnome.org/show_bug.cgi?id=711149 In Function: int xmlCharEncCloseFunc(xmlCharEncodingHandler *handler) If the freed handler is any one of handlers[i] list, then it will make that hanldlers[i] as dangling. This may lead to crash issues at places where handlers is read.
* #705267 - add additional defines checks for support "./configure --with-minimum"Denis Pauk2013-08-031-0/+2
| | | | https://bugzilla.gnome.org/show_bug.cgi?id=705267
* Fix the flushing out of raw buffers on encoding conversionsDaniel Veillard2013-02-131-3/+4
| | | | | | | | | | | | https://bugzilla.gnome.org/show_bug.cgi?id=692915 the new set of converting functions tried to limit the encoding conversion of the raw buffer to the consumption one to work in a more progressive fashion. Unfortunately this was bad for performances and led to errors on progressive parsing when a very large chunk was close to the end of the document. Fix the new internal function and switch back to the old way of converting. Fix another bug in the process.
* Try IBM-037 when looking for EBCDIC handlersPetr Sumbera2012-12-121-0/+2
| | | | | http://en.wikipedia.org/wiki/EBCDIC_037 as it is another variat of EBCDIC
* Big space and tab cleanupDaniel Veillard2012-09-111-3/+3
| | | | Remove all space before tabs and space and tabs at end of lines.
* Regenerating docs and API filesDaniel Veillard2012-08-101-2/+2
| | | | | | | | | | | | Various cleanups * configure.in: force regeneration of APIs in my environment * buf.c buf.h enc.h encoding.c include/libxml/tree.h include/libxml/xmlerror.h save.h tree.c: various comment cleanups pointed by apibuild * doc/apibuild.py: added the 3 new internal headers in the excludes * doc/libxml2-api.xml doc/libxml2-refs.xml: regenerated the API * doc/symbols.xml: listing new entry points for 2.9.0 * doc/devhelp/*: regenerated
* Adding new encoding function to deal with the new structuresDaniel Veillard2012-07-231-4/+479
| | | | | | * encoding.c: adds xmlCharEncFirstLineInput, xmlCharEncInput and xmlCharEncOutput * enc.h: the functions are not made public but added to this new header
* Prevent an infinite loop when dumping a node with encoding problemsTimothy Elliott2012-05-081-2/+18
| | | | | | | | | | | | | | | | When a node is dumped with a new encoding, we may encounter characters that are not supported in the new encoding. libxml2 handles this by replacing the character with character references, but in some encodings this can result in an infinite loop when the character references themselves contain unsupported characters. This fixes the infinite loop by undoing a character reference substitution when it cannot be inserted, and returning an encoder error. This bug was noticed when looking into an infinite loop bug report for the Ruby Nokogiri project. The original bug report, "nokogiri process hangs on call to inner_html" is here: https://github.com/tenderlove/nokogiri/issues/400
* Fix an off by one error in encodingDaniel Veillard2011-08-191-2/+2
| | | | | this off by one error doesn't seems to reproduce on linux but the error is real.