summaryrefslogtreecommitdiff
path: root/ghc/compiler/utils/Encoding.hs
Commit message (Collapse)AuthorAgeFilesLines
* Reorganisation of the source treeSimon Marlow2006-04-071-373/+0
| | | | | | | | | | | | | | | Most of the other users of the fptools build system have migrated to Cabal, and with the move to darcs we can now flatten the source tree without losing history, so here goes. The main change is that the ghc/ subdir is gone, and most of what it contained is now at the top level. The build system now makes no pretense at being multi-project, it is just the GHC build system. No doubt this will break many things, and there will be a period of instability while we fix the dependencies. A straightforward build should work, but I haven't yet fixed binary/source distributions. Changes to the Building Guide will follow, too.
* [project @ 2006-01-10 14:39:38 by simonmar]simonmar2006-01-101-1/+1
| | | | prevChar: don't back up over decoding errors
* [project @ 2006-01-09 13:29:02 by simonmar]simonmar2006-01-091-2/+2
| | | | Avoid desugaring bug in HEAD (see test ds057).
* [project @ 2006-01-09 13:25:50 by simonmar]simonmar2006-01-091-19/+6
| | | | | | | | Fix up to compile with GHC 5.04.x again. Also includes a fix for a memory error I discovered along the way: should fix the "scavenge_one" crash in the stage2 build of recent HEADs.
* [project @ 2006-01-06 16:30:17 by simonmar]simonmar2006-01-061-0/+386
Add support for UTF-8 source files GHC finally has support for full Unicode in source files. Source files are now assumed to be UTF-8 encoded, and the full range of Unicode characters can be used, with classifications recognised using the implementation from Data.Char. This incedentally means that only the stage2 compiler will recognise Unicode in source files, because I was too lazy to port the unicode classifier code into libcompat. Additionally, the following synonyms for keywords are now recognised: forall symbol (U+2200) forall right arrow (U+2192) -> left arrow (U+2190) <- horizontal ellipsis (U+22EF) .. there are probably more things we could add here. This will break some source files if Latin-1 characters are being used. In most cases this should result in a UTF-8 decoding error. Later on if we want to support more encodings (perhaps with a pragma to specify the encoding), I plan to do it by recoding into UTF-8 before parsing. Internally, there were some pretty big changes: - FastStrings are now stored in UTF-8 - Z-encoding has been moved right to the back end. Previously we used to Z-encode every identifier on the way in for simplicity, and only decode when we needed to show something to the user. Instead, we now keep every string in its UTF-8 encoding, and Z-encode right before printing it out. To avoid Z-encoding the same string multiple times, the Z-encoding is cached inside the FastString the first time it is requested. This speeds up the compiler - I've measured some definite improvement in parsing at least, and I expect compilations overall to be faster too. It also cleans up a lot of cruft from the OccName interface. Z-encoding is nicely hidden inside the Outputable instance for Names & OccNames now. - StringBuffers are UTF-8 too, and are now represented as ForeignPtrs. - I've put together some test cases, not by any means exhaustive, but there are some interesting UTF-8 decoding error cases that aren't obvious. Also, take a look at unicode001.hs for a demo.