summaryrefslogtreecommitdiff
path: root/libraries/base/GHC/IO/Encoding.hs
Commit message (Collapse)AuthorAgeFilesLines
* Add Javascript backendSylvain Henry2022-11-291-1/+5
| | | | | | | | | | | | | | | Add JS backend adapted from the GHCJS project by Luite Stegeman. Some features haven't been ported or implemented yet. Tests for these features have been disabled with an associated gitlab ticket. Bump array submodule Work funded by IOG. Co-authored-by: Jeffrey Young <jeffrey.young@iohk.io> Co-authored-by: Luite Stegeman <stegeman@gmail.com> Co-authored-by: Josh Meredith <joshmeredith2008@gmail.com>
* Documentation for setLocaleEncodingBodigrim2022-04-271-2/+27
|
* docs: Clarify documentation of `getFileSystemEncoding` (#20344)Zubin Duggal2021-10-051-1/+3
| | | | It may not always be a Unicode encoding
* base: Ensure that encoding global variables aren't inlinedBen Gamari2020-03-311-0/+10
| | | | | | | | | | | | As noted in #17970, these (e.g. `getFileSystemEncoding` and `setFileSystemEncoding`) previously had unfoldings, which would break their global-ness. While not strictly necessary, I also add a NOINLINE on `initLocaleEncoding` since it is used in `System.IO`, ensuring that we only system's query the locale encoding once. Fixes #17970.
* Fix typos, via a Levenshtein-style correctorBrian Wignall2020-01-041-2/+2
|
* Properly escape character literals in HaddocksAlec Theriault2019-02-151-1/+1
| | | | | | | | Character literals in Haddock should not be written as plain `'\n'` since single quotes are for linking identifiers. Besides, since we want the character literal to be monospaced, we really should use `@\'\\n\'@`. [skip ci]
* Fix ambiguous/out-of-scope Haddock identifiersAlec Theriault2018-08-211-3/+4
| | | | | | | | | | | | | | | | | This drastically cuts down on the number of Haddock warnings when making docs for `base`. Plus this means more actual links end up in the docs! Also fixed other small mostly markup issues in the documentation along the way. This is a docs-only change. Reviewers: hvr, bgamari, thomie Reviewed By: thomie Subscribers: thomie, rwbarton, carter Differential Revision: https://phabricator.haskell.org/D5055
* Initialize hs_init with UTF8 encoded arguments on Windows.Andreas Klebinger2017-07-271-0/+12
| | | | | | | | | | | | | | | | | | | | | | | Summary: Get utf8 encoded arguments before we call hs_init and use them instead of ignoring hs_init arguments. This reduces differing behaviour of the RTS between windows and linux and simplifies the code involved. A few testcases were changed to expect the same result on windows as on linux after the changes. This fixes #13940. Test Plan: ./validate Reviewers: austin, hvr, bgamari, erikd, simonmar, Phyx Subscribers: Phyx, rwbarton, thomie GHC Trac Issues: #13940 Differential Revision: https://phabricator.haskell.org/D3739
* Always use native-Haskell de/encoders for ASCII and latin1Thomas Miedema2016-05-241-15/+19
| | | | | | | | This fixes test encoding005 on Windows (#10623). Reviewed by: austin, bgamari Differential Revision: https://phabricator.haskell.org/D2262
* Use builtin ISO 8859-1 decoder in mkTextEncodingHerbert Valerio Riedel2015-12-041-0/+2
| | | | | | | | | We already do this for UTF8/16/32, so it seems obvious do the same for the closely related popular ISO 8859-1 encoding, and avoid iconv issues on some platforms (such as AIX which which bundles a broken `libiconv` by default) This fixes #11096
* When iconv is unavailable, use an ASCII encoding to encode ASCIIReid Barton2015-07-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | D898 and D1059 implemented a fallback behavior to handle the case that the end user's iconv installation is broken (typically due to running inside a chroot in which the necessary locale files and/or gconv modules have not been installed). In this case, if the program requests an ASCII locale, GHC's char8 encoding is used rather than the program failing. However, silently mangling data like char8 does when the programmer did not ask for it is poor behavior, for reasons described in D1059. This commit implements an ASCII encoding and uses it in the fallback case when iconv is unavailable and the user has requested ASCII. Test Plan: Added tests for the encodings defined in Latin1. Also, manually ran a statically-linked executable of that test in a chroot and the tests passed (up to the ones that call mkTextEncoding "LATIN1", since there is no fallback from iconv for that case yet). Reviewers: austin, hvr, hsyl20, bgamari Reviewed By: hsyl20, bgamari Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D1085 GHC Trac Issues: #7695, #10623
* Fix self-contained handling of ASCII encodingBen Gamari2015-07-101-10/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | D898 was primarily intended to fix hangs in the event that iconv was unavailable (namely #10298 and #7695). In addition to this fix, it also introduced self-contained handling of ANSI terminals to allow compiled executables to run in minimal environments lacking iconv. However, the behavior that the patch introduced is highly suspicious. Specifically, it gives the user a UTF-8 encoding even if they requested ASCII. This has the potential to break quite a lot of code. At very least it breaks GHC's Unicode terminal detection logic, which attempts to catch an invalid character when encoding a pair of smart-quotes. Of course, this exception will never be thrown if a UTF-8 encoder is used. Here we use the `char8` encoding to handle requests for ASCII encodings in the event that we find iconv to be non-functional. Fixes #10623. Test Plan: Validate with T8959a Reviewers: rwbarton, hvr, austin, hsyl20 Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D1059 GHC Trac Issues: #10623
* base: fix #10298 & #7695Austin Seipp2015-05-281-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: This applies a patch from Reid Barton and Sylvain Henry, which fix a disasterous infinite loop when iconv fails to load locale files, as specified in #10298. The fix is a bit of a hack but should be fine - for the actual reasoning behind it, see `Note [Disaster and iconv]` for more info. In addition to this fix, we also patch up the IO Encoding utilities to recognize several variations of the 'ASCII' encoding (including its aliases) directly so that GHC can do conversions without iconv. This allows a static binary to sit in an initramfs. Authored-by: Reid Barton <rwbarton@gmail.com> Authored-by: Sylvain Henry <hsyl20@gmail.com> Signed-off-by: Austin Seipp <austin@well-typed.com> Test Plan: Eyeballed it. Reviewers: rwbarton, hvr Subscribers: bgamari, thomie Differential Revision: https://phabricator.haskell.org/D898 GHC Trac Issues: #10298, #7695
* Convert `/Since: .../` to new `@since ...` syntaxHerbert Valerio Riedel2014-12-161-6/+6
| | | | | | Starting with Haddock 2.16 there's a new built-in support for since-annotations Note: This exposes a bug in the `@since` implementation (see e.g. `Data.Bits`)
* `M-x delete-trailing-whitespace` & `M-x untabify`Herbert Valerio Riedel2014-09-241-5/+5
| | | | ...several modules in `base` recently touched by me
* Simplify import-graph a bit moreHerbert Valerio Riedel2014-09-211-1/+1
| | | | | | This is preparatory refactoring for avoiding import cycles when `Data.Traversable` will be imported by `Control.Monad` and `Data.List` for implementing #9586
* Move `Maybe`-typedef into GHC.BaseHerbert Valerio Riedel2014-09-161-1/+0
| | | | | | | This is preparatory work for reintroducing SPECIALISEs that were lost in d94de87252d0fe2ae97341d186b03a2fbe136b04 Differential Revision: https://phabricator.haskell.org/D214
* Drop redundant `{-# LANGUAGE #-}` pragmasHerbert Valerio Riedel2013-09-281-1/+1
| | | | | | | | | | | | | This removes language pragmas from Haskell modules which are implicitly active with `default-language: Haskell2010`. Specifically, the following language extension pragmas are removed by this commit: - PatternGuards - ForeignFunctionInterface - EmptyDataDecls - NoBangPatterns Signed-off-by: Herbert Valerio Riedel <hvr@gnu.org>
* Add Haddock `/Since: 4.4.0.0/` comments to symbolsHerbert Valerio Riedel2013-09-221-0/+2
| | | | | | | | | | This commit retroactively adds `/Since: 4.4.0.0/` annotations to symbols newly added/exposed in `base-4.4.0.0` (as shipped with GHC 7.2.1). See also 6368362f which adds the respective annotation for symbols newly added in `base-4.7.0.0` (that goes together with GHC 7.8.1). Signed-off-by: Herbert Valerio Riedel <hvr@gnu.org>
* Add Haddock `/Since: 4.5.[01].0/` comments to symbolsHerbert Valerio Riedel2013-09-221-0/+9
| | | | | | | | | | This commit retroactively adds `/Since: 4.5.[01].0/` annotations to symbols newly added/exposed in `base-4.5.[01].0` (as shipped with GHC 7.4.[12]). See also 6368362f which adds the respective annotation for symbols newly added in `base-4.7.0.0` (that goes together with GHC 7.8.1). Signed-off-by: Herbert Valerio Riedel <hvr@gnu.org>
* Improve documentation for mkTextEncodingMax Bolingbroke2013-04-231-4/+24
|
* Fix compilation error on windows.David Terei2011-11-221-3/+3
|
* Make the fileSystemEncoding/localeEncoding/foreignEncoding mutableMax Bolingbroke2011-11-181-10/+25
|
* Fix build on WindowsMax Bolingbroke2011-11-021-5/+5
|
* Be more forgiving about encoding name capitalization/hyphenizationMax Bolingbroke2011-11-021-8/+9
|
* Avoid using iconv for the locale TextEncoding if we can help itMax Bolingbroke2011-11-021-20/+34
|
* Update base for latest Safe Haskell.David Terei2011-10-251-0/+1
|
* Fix a typoIan Lynagh2011-07-071-1/+1
|
* SafeHaskell: Added SafeHaskell to baseDavid Terei2011-06-181-10/+10
|
* Add System.IO.char8, the encoding used by openBinaryFile,Simon Marlow2011-05-241-1/+12
| | | | | | and correct the documentation for hSetBinaryMode which claimed that it was using the latin1 encoding when in fact it was using an unchecked modulo-256 version of it.
* Use Unicode private-use characters for roundtrippingMax Bolingbroke2011-05-181-3/+3
| | | | | | | | | This replaces the previous scheme (which used lone surrogates). The reason is that there is Haskell software in the wild (i.e. the text package) that chokes on Char values that do not represent Unicode characters. This new approach will not work correctly if the reserved private-use characters are actually encountered in the input, but we expect this to be rare.
* Big patch to improve Unicode support in GHC. Validated on OS X and Windows, thisMax Bolingbroke2011-05-141-22/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | patch series fixes #5061, #1414, #3309, #3308, #3307, #4006 and #4855. The major changes are: 1) Make Foreign.C.String.*CString use the locale encoding This change follows the FFI specification in Haskell 98, which has never actually been implemented before. The functions exported from Foreign.C.String are partially-applied versions of those from GHC.Foreign, which allows the user to supply their own TextEncoding. We also introduce foreignEncoding as the name of the text encoding that follows the FFI appendix in that it transliterates encoding errors. 2) I also changed the code so that mkTextEncoding always tries the native-Haskell decoders in preference to those from iconv, even on non-Windows. The motivation here is simply that it is better for compatibility if we do this, and those are the ones you get for the utf* and latin1* predefined TextEncodings anyway. 3) Implement surrogate-byte error handling mode for TextEncoding This implements PEP383-like behaviour so that we are able to roundtrip byte strings through Strings without loss of information. The withFilePath function now uses this encoding to get to/from CStrings, so any code that uses that will get the right PEP383 behaviour automatically. 4) Implement three other coding failure modes: ignore, throw error, transliterate These mimic the behaviour of the GNU Iconv extensions.
* add missing extensions for WindowsSimon Marlow2011-01-311-1/+1
|
* Use explicit language extensions & remove extension fields from base.cabalsimonpj@microsoft.com2011-01-281-1/+3
| | | | | | | | | | Add explicit {-# LANGUAGE xxx #-} pragmas to each module, that say what extensions that module uses. This makes it clearer where different extensions are used in the (large, variagated) base package. Now base.cabal doesn't need any extensions field Thanks to Bas van Dijk for doing all the work.
* On Windows, use the console code page for text file encoding/decoding.Judah Jacobson2009-09-131-1/+9
| | | | | | | | | | | | | | We keep all of the code page tables in the module GHC.IO.Encoding.CodePage.Table. That file was generated automatically by running codepages/MakeTable.hs; more details are in the comments at the start of that script. Storing the lookup tables adds about 40KB to each statically linked executable; this only increases the size of a "hello world" program by about 7%. Currently we do not support double-byte encodings (Chinese/Japanese/Korean), since including those codepages would increase the table size to 400KB. It will be straightforward to implement them once the work on library DLLs is finished.
* warning fix: -fno-implicit-prelude -> -XNoImplicitPreludeSimon Marlow2009-07-151-1/+1
|
* Add more documentation to mkTextEncodingSimon Marlow2009-07-151-1/+18
| | | | | noting that "//IGNORE" and "//TRANSLIT" suffixes can be used with GNU iconv.
* Add the utf8_bom codecSimon Marlow2009-07-151-1/+12
| | | | as suggested during the discussion on the libraries list.
* Export Unicode and newline functionality from System.IO; update Haddock docsSimon Marlow2009-07-131-14/+26
|
* Remove unused imports from basesimonpj@microsoft.com2009-07-061-1/+1
| | | | | These unused imports are detected by the new unused-import code
* Rewrite of the IO library, including Unicode supportSimon Marlow2009-06-121-0/+107
Highlights: * Unicode support for Handle I/O: ** Automatic encoding and decoding using a per-Handle encoding. ** The encoding defaults to the locale encoding (only on Unix so far, perhaps Windows later). ** Built-in UTF-8, UTF-16 (BE/LE), and UTF-32 (BE/LE) codecs. ** iconv-based codec for other encodings on Unix * Modularity: the low-level IO interface is exposed as a type class (GHC.IO.IODevice) so you can build your own low-level IO providers and make Handles from them. * Newline translation: instead of being Windows-specific wired-in magic, the translation from \r\n -> \n and back again is available on all platforms and is configurable for reading/writing independently. Unicode-aware Handles ~~~~~~~~~~~~~~~~~~~~~ This is a significant restructuring of the Handle implementation with the primary goal of supporting Unicode character encodings. The only change to the existing behaviour is that by default, text IO is done in the prevailing locale encoding of the system (except on Windows [1]). Handles created by openBinaryFile use the Latin-1 encoding, as do Handles placed in binary mode using hSetBinaryMode. We provide a way to change the encoding for an existing Handle: GHC.IO.Handle.hSetEncoding :: Handle -> TextEncoding -> IO () and various encodings (from GHC.IO.Encoding): latin1, utf8, utf16, utf16le, utf16be, utf32, utf32le, utf32be, localeEncoding, and a way to lookup other encodings: GHC.IO.Encoding.mkTextEncoding :: String -> IO TextEncoding (it's system-dependent whether the requested encoding will be available). We may want to export these from somewhere more permanent; that's a topic for a future library proposal. Thanks to suggestions from Duncan Coutts, it's possible to call hSetEncoding even on buffered read Handles, and the right thing happens. So we can read from text streams that include multiple encodings, such as an HTTP response or email message, without having to turn buffering off (though there is a penalty for switching encodings on a buffered Handle, as the IO system has to do some re-decoding to figure out where it should start reading from again). If there is a decoding error, it is reported when an attempt is made to read the offending character from the Handle, as you would expect. Performance varies. For "hGetContents >>= putStr" I found the new library was faster on my x86_64 machine, but slower on an x86. On the whole I'd expect things to be a bit slower due to the extra decoding/encoding, but probabaly not noticeably. If performance is critical for your app, then you should be using bytestring and text anyway. [1] Note: locale encoding is not currently implemented on Windows due to the built-in Win32 APIs for encoding/decoding not being sufficient for our purposes. Ask me for details. Offers of help gratefully accepted. Newline Translation ~~~~~~~~~~~~~~~~~~~ In the old IO library, text-mode Handles on Windows had automatic translation from \r\n -> \n on input, and the opposite on output. It was implemented using the underlying CRT functions, which meant that there were certain odd restrictions, such as read/write text handles needing to be unbuffered, and seeking not working at all on text Handles. In the rewrite, newline translation is now implemented in the upper layers, as it needs to be since we have to perform Unicode decoding before newline translation. This means that it is now available on all platforms, which can be quite handy for writing portable code. For now, I have left the behaviour as it was, namely \r\n -> \n on Windows, and no translation on Unix. However, another reasonable default (similar to what Python does) would be to do \r\n -> \n on input, and convert to the platform-native representation (either \r\n or \n) on output. This is called universalNewlineMode (below). The API is as follows. (available from GHC.IO.Handle for now, again this is something we will probably want to try to get into System.IO at some point): -- | The representation of a newline in the external file or stream. data Newline = LF -- ^ "\n" | CRLF -- ^ "\r\n" deriving Eq -- | Specifies the translation, if any, of newline characters between -- internal Strings and the external file or stream. Haskell Strings -- are assumed to represent newlines with the '\n' character; the -- newline mode specifies how to translate '\n' on output, and what to -- translate into '\n' on input. data NewlineMode = NewlineMode { inputNL :: Newline, -- ^ the representation of newlines on input outputNL :: Newline -- ^ the representation of newlines on output } deriving Eq -- | The native newline representation for the current platform nativeNewline :: Newline -- | Map "\r\n" into "\n" on input, and "\n" to the native newline -- represetnation on output. This mode can be used on any platform, and -- works with text files using any newline convention. The downside is -- that @readFile a >>= writeFile b@ might yield a different file. universalNewlineMode :: NewlineMode universalNewlineMode = NewlineMode { inputNL = CRLF, outputNL = nativeNewline } -- | Use the native newline representation on both input and output nativeNewlineMode :: NewlineMode nativeNewlineMode = NewlineMode { inputNL = nativeNewline, outputNL = nativeNewline } -- | Do no newline translation at all. noNewlineTranslation :: NewlineMode noNewlineTranslation = NewlineMode { inputNL = LF, outputNL = LF } -- | Change the newline translation mode on the Handle. hSetNewlineMode :: Handle -> NewlineMode -> IO () IO Devices ~~~~~~~~~~ The major change here is that the implementation of the Handle operations is separated from the underlying IO device, using type classes. File descriptors are just one IO provider; I have also implemented memory-mapped files (good for random-access read/write) and a Handle that pipes output to a Chan (useful for testing code that writes to a Handle). New kinds of Handle can be implemented outside the base package, for instance someone could write bytestringToHandle. A Handle is made using mkFileHandle: -- | makes a new 'Handle' mkFileHandle :: (IODevice dev, BufferedIO dev, Typeable dev) => dev -- ^ the underlying IO device, which must support -- 'IODevice', 'BufferedIO' and 'Typeable' -> FilePath -- ^ a string describing the 'Handle', e.g. the file -- path for a file. Used in error messages. -> IOMode -- ^ The mode in which the 'Handle' is to be used -> Maybe TextEncoding -- ^ text encoding to use, if any -> NewlineMode -- ^ newline translation mode -> IO Handle This also means that someone can write a completely new IO implementation on Windows based on native Win32 HANDLEs, and distribute it as a separate package (I really hope somebody does this!). This restructuring isn't as radical as previous designs. I haven't made any attempt to make a separate binary I/O layer, for example (although hGetBuf/hPutBuf do bypass the text encoding and newline translation). The main goal here was to get Unicode support in, and to allow others to experiment with making new kinds of Handle. We could split up the layers further later. API changes and Module structure ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NB. GHC.IOBase and GHC.Handle are now DEPRECATED (they are still present, but are just re-exporting things from other modules now). For 6.12 we'll want to bump base to version 5 and add a base4-compat. For now I'm using #if __GLASGOW_HASKEL__ >= 611 to avoid deprecated warnings. I split modules into smaller parts in many places. For example, we now have GHC.IORef, GHC.MVar and GHC.IOArray containing the implementations of IORef, MVar and IOArray respectively. This was necessary for untangling dependencies, but it also makes things easier to follow. The new module structurue for the IO-relatied parts of the base package is: GHC.IO Implementation of the IO monad; unsafe*; throw/catch GHC.IO.IOMode The IOMode type GHC.IO.Buffer Buffers and operations on them GHC.IO.Device The IODevice and RawIO classes. GHC.IO.BufferedIO The BufferedIO class. GHC.IO.FD The FD type, with instances of IODevice, RawIO and BufferedIO. GHC.IO.Exception IO-related Exceptions GHC.IO.Encoding The TextEncoding type; built-in TextEncodings; mkTextEncoding GHC.IO.Encoding.Types GHC.IO.Encoding.Iconv GHC.IO.Encoding.Latin1 GHC.IO.Encoding.UTF8 GHC.IO.Encoding.UTF16 GHC.IO.Encoding.UTF32 Implementation internals for GHC.IO.Encoding GHC.IO.Handle The main API for GHC's Handle implementation, provides all the Handle operations + mkFileHandle + hSetEncoding. GHC.IO.Handle.Types GHC.IO.Handle.Internals GHC.IO.Handle.Text Implementation of Handles and operations. GHC.IO.Handle.FD Parts of the Handle API implemented by file-descriptors: openFile, stdin, stdout, stderr, fdToHandle etc.