summaryrefslogtreecommitdiff
path: root/ghc/compiler/utils/UnicodeUtil.lhs
Commit message (Collapse)AuthorAgeFilesLines
* [project @ 2006-01-06 16:30:17 by simonmar]simonmar2006-01-061-36/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for UTF-8 source files GHC finally has support for full Unicode in source files. Source files are now assumed to be UTF-8 encoded, and the full range of Unicode characters can be used, with classifications recognised using the implementation from Data.Char. This incedentally means that only the stage2 compiler will recognise Unicode in source files, because I was too lazy to port the unicode classifier code into libcompat. Additionally, the following synonyms for keywords are now recognised: forall symbol (U+2200) forall right arrow (U+2192) -> left arrow (U+2190) <- horizontal ellipsis (U+22EF) .. there are probably more things we could add here. This will break some source files if Latin-1 characters are being used. In most cases this should result in a UTF-8 decoding error. Later on if we want to support more encodings (perhaps with a pragma to specify the encoding), I plan to do it by recoding into UTF-8 before parsing. Internally, there were some pretty big changes: - FastStrings are now stored in UTF-8 - Z-encoding has been moved right to the back end. Previously we used to Z-encode every identifier on the way in for simplicity, and only decode when we needed to show something to the user. Instead, we now keep every string in its UTF-8 encoding, and Z-encode right before printing it out. To avoid Z-encoding the same string multiple times, the Z-encoding is cached inside the FastString the first time it is requested. This speeds up the compiler - I've measured some definite improvement in parsing at least, and I expect compilations overall to be faster too. It also cleans up a lot of cruft from the OccName interface. Z-encoding is nicely hidden inside the Outputable instance for Names & OccNames now. - StringBuffers are UTF-8 too, and are now represented as ForeignPtrs. - I've put together some test cases, not by any means exhaustive, but there are some interesting UTF-8 decoding error cases that aren't obvious. Also, take a look at unicode001.hs for a demo.
* [project @ 2002-04-11 12:03:29 by simonpj]simonpj2002-04-111-9/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ------------------- Mainly derived Read ------------------- This commit is a tangle of several things that somehow got wound up together, I'm afraid. The main course ~~~~~~~~~~~~~~~ Replace the derived-Read machinery with Koen's cunning new parser combinator library. The result should be * much smaller code sizes from derived Read * faster execution of derived Read WARNING: I have not thoroughly tested this stuff; I'd be glad if you did! All the hard work is done, but there may be a few nits. The Read class gets two new methods, not exposed in the H98 inteface of course: class Read a where readsPrec :: Int -> ReadS a readList :: ReadS [a] readPrec :: ReadPrec a -- NEW readListPrec :: ReadPrec [a] -- NEW There are the following new libraries: Text.ParserCombinators.ReadP Koens combinator parser Text.ParserCombinators.ReadPrec Ditto, but with precedences Text.Read.Lex An emasculated lexical analyser that provides the functionality of H98 'lex' TcGenDeriv is changed to generate code that uses the new libraries. The built-in instances of Read (List, Maybe, tuples, etc) use the new libraries. Other stuff ~~~~~~~~~~~ 1. Some fixes the the plumbing of external-core generation. Sigbjorn did most of the work earlier, but this commit completes the renaming and typechecking plumbing. 2. Runtime error-generation functions, such as GHC.Err.recSelErr, GHC.Err.recUpdErr, etc, now take an Addr#, pointing to a UTF8-encoded C string, instead of a Haskell string. This makes the *calls* to these functions easier to generate, and smaller too, which is a good thing. In particular, it means that MkId.mkRecordSelectorId doesn't need to be passed "unpackCStringId", which was GRUESOME; and that in turn means that tcTypeAndClassDecls doesn't need to be passed unf_env, which is a very worthwhile cleanup. Win/win situation. 3. GHC now faithfully translates do-notation using ">>" for statements with no binding, just as the report says. While I was there I tidied up HsDo to take a list of Ids instead of 3 (but now 4) separate Ids. Saves a bit of code here and there. Also introduced Inst.newMethodFromName to package a common idiom.
* [project @ 2001-02-18 14:45:15 by qrczak]qrczak2001-02-181-14/+1
| | | | | | Recent Unicode and future ISO-10646 finally decided that the character code space ends at U+10FFFF. Let ghc follow the rules: maxBound::Char is now '\x10FFFF', etc.
* [project @ 2000-11-07 13:12:21 by simonpj]simonpj2000-11-071-2/+2
| | | | More small changes
* [project @ 2000-08-07 23:37:19 by qrczak]qrczak2000-08-071-0/+46
Now Char, Char#, StgChar have 31 bits (physically 32). "foo"# is still an array of bytes. CharRep represents 32 bits (on a 64-bit arch too). There is also Int8Rep, used in those places where bytes were originally meant. readCharArray, indexCharOffAddr etc. still use bytes. Storable and {I,M}Array use wide Chars. In future perhaps all sized integers should be primitive types. Then some usages of indexing primops scattered through the code could be changed to then-available Int8 ones, and then Char variants of primops could be made wide (other usages that handle text should use conversion that will be provided later). I/O and _ccall_ arguments assume ISO-8859-1. UTF-8 is internally used for string literals (only). Z-encoding is ready for Unicode identifiers. Ranges of intlike and charlike closures are more easily configurable. I've probably broken nativeGen/MachCode.lhs:chrCode for Alpha but I don't know the Alpha assembler to fix it (what is zapnot?). Generally I'm not sure if I've done the NCG changes right. This commit breaks the binary compatibility (of course). TODO: * is* and to{Lower,Upper} in Char (in progress). * Libraries for text conversion (in design / experiments), to be plugged to I/O and a higher level foreign library. * PackedString. * StringBuffer and accepting source in encodings other than ISO-8859-1.