| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add support for UTF-8 source files
GHC finally has support for full Unicode in source files. Source
files are now assumed to be UTF-8 encoded, and the full range of
Unicode characters can be used, with classifications recognised using
the implementation from Data.Char. This incedentally means that only
the stage2 compiler will recognise Unicode in source files, because I
was too lazy to port the unicode classifier code into libcompat.
Additionally, the following synonyms for keywords are now recognised:
forall symbol (U+2200) forall
right arrow (U+2192) ->
left arrow (U+2190) <-
horizontal ellipsis (U+22EF) ..
there are probably more things we could add here.
This will break some source files if Latin-1 characters are being used.
In most cases this should result in a UTF-8 decoding error. Later on
if we want to support more encodings (perhaps with a pragma to specify
the encoding), I plan to do it by recoding into UTF-8 before parsing.
Internally, there were some pretty big changes:
- FastStrings are now stored in UTF-8
- Z-encoding has been moved right to the back end. Previously we
used to Z-encode every identifier on the way in for simplicity,
and only decode when we needed to show something to the user.
Instead, we now keep every string in its UTF-8 encoding, and
Z-encode right before printing it out. To avoid Z-encoding the
same string multiple times, the Z-encoding is cached inside the
FastString the first time it is requested.
This speeds up the compiler - I've measured some definite
improvement in parsing at least, and I expect compilations overall
to be faster too. It also cleans up a lot of cruft from the
OccName interface. Z-encoding is nicely hidden inside the
Outputable instance for Names & OccNames now.
- StringBuffers are UTF-8 too, and are now represented as
ForeignPtrs.
- I've put together some test cases, not by any means exhaustive,
but there are some interesting UTF-8 decoding error cases that
aren't obvious. Also, take a look at unicode001.hs for a demo.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
-------------------
Mainly derived Read
-------------------
This commit is a tangle of several things that somehow got wound up
together, I'm afraid.
The main course
~~~~~~~~~~~~~~~
Replace the derived-Read machinery with Koen's cunning new parser
combinator library. The result should be
* much smaller code sizes from derived Read
* faster execution of derived Read
WARNING: I have not thoroughly tested this stuff; I'd be glad if you did!
All the hard work is done, but there may be a few nits.
The Read class gets two new methods, not exposed
in the H98 inteface of course:
class Read a where
readsPrec :: Int -> ReadS a
readList :: ReadS [a]
readPrec :: ReadPrec a -- NEW
readListPrec :: ReadPrec [a] -- NEW
There are the following new libraries:
Text.ParserCombinators.ReadP Koens combinator parser
Text.ParserCombinators.ReadPrec Ditto, but with precedences
Text.Read.Lex An emasculated lexical analyser
that provides the functionality
of H98 'lex'
TcGenDeriv is changed to generate code that uses the new libraries.
The built-in instances of Read (List, Maybe, tuples, etc) use the new
libraries.
Other stuff
~~~~~~~~~~~
1. Some fixes the the plumbing of external-core generation. Sigbjorn
did most of the work earlier, but this commit completes the renaming and
typechecking plumbing.
2. Runtime error-generation functions, such as GHC.Err.recSelErr,
GHC.Err.recUpdErr, etc, now take an Addr#, pointing to a UTF8-encoded
C string, instead of a Haskell string. This makes the *calls* to these
functions easier to generate, and smaller too, which is a good thing.
In particular, it means that MkId.mkRecordSelectorId doesn't need to
be passed "unpackCStringId", which was GRUESOME; and that in turn means
that tcTypeAndClassDecls doesn't need to be passed unf_env, which is
a very worthwhile cleanup. Win/win situation.
3. GHC now faithfully translates do-notation using ">>" for statements
with no binding, just as the report says. While I was there I tidied
up HsDo to take a list of Ids instead of 3 (but now 4) separate Ids.
Saves a bit of code here and there. Also introduced Inst.newMethodFromName
to package a common idiom.
|
|
|
|
|
|
| |
Recent Unicode and future ISO-10646 finally decided that the character
code space ends at U+10FFFF. Let ghc follow the rules: maxBound::Char
is now '\x10FFFF', etc.
|
|
|
|
| |
More small changes
|
|
Now Char, Char#, StgChar have 31 bits (physically 32).
"foo"# is still an array of bytes.
CharRep represents 32 bits (on a 64-bit arch too). There is also
Int8Rep, used in those places where bytes were originally meant.
readCharArray, indexCharOffAddr etc. still use bytes. Storable and
{I,M}Array use wide Chars.
In future perhaps all sized integers should be primitive types. Then
some usages of indexing primops scattered through the code could
be changed to then-available Int8 ones, and then Char variants of
primops could be made wide (other usages that handle text should use
conversion that will be provided later).
I/O and _ccall_ arguments assume ISO-8859-1. UTF-8 is internally used
for string literals (only).
Z-encoding is ready for Unicode identifiers.
Ranges of intlike and charlike closures are more easily configurable.
I've probably broken nativeGen/MachCode.lhs:chrCode for Alpha but I
don't know the Alpha assembler to fix it (what is zapnot?). Generally
I'm not sure if I've done the NCG changes right.
This commit breaks the binary compatibility (of course).
TODO:
* is* and to{Lower,Upper} in Char (in progress).
* Libraries for text conversion (in design / experiments),
to be plugged to I/O and a higher level foreign library.
* PackedString.
* StringBuffer and accepting source in encodings other than ISO-8859-1.
|