|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here we copy a subset of the UTF-8 implementation living in `ghc-boot`
into `base`, with the intent of dropping the former in the future. For
this reason, the `ghc-boot` copy is now CPP-guarded on
`MIN_VERSION_base(4,18,0)`.
Naturally, we can't copy *all* of the functions defined by `ghc-boot` as
some depend upon `bytestring`; we rather just copy those which only
depend upon `base` and `ghc-prim`.
Further consolidation?
----------------------
Currently GHC ships with at least five UTF-8 implementations:
* the implementation used by GHC in `ghc-boot:GHC.Utils.Encoding`; this
can be used at a number of types including `Addr#`, `ByteArray#`,
`ForeignPtr`, `Ptr`, `ShortByteString`, and `ByteString`. Most of this
can be removed in GHC 9.6+2, when the copies in `base` will become
available to `ghc-boot`.
* the copy of the `ghc-boot` definition now exported by
`base:GHC.Encoding.UTF8`. This can be used at `Addr#`, `Ptr`,
`ByteArray#`, and `ForeignPtr`
* the decoder used by `unpackCStringUtf8#` in `ghc-prim:GHC.CString`;
this is specialised at `Addr#`.
* the codec used by the IO subsystem in `base:GHC.IO.Encoding.UTF8`;
this is specialised at `Addr#` but, unlike the above, supports
recovery in the presence of partial codepoints (since in IO contexts
codepoints may be broken across buffers)
* the implementation provided by the `text` library
This does seem a tad silly. On the other hand, these implementations
*do* materially differ from one another (e.g. in the types they support,
the detail in errors they can report, and the ability to recover from
partial codepoints). Consequently, it's quite unclear that further
consolidate would be worthwhile.
|