delta/haskell.git - gitlab.haskell.org: ghc/ghc.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	RTS tidyup sweep, first phase	Simon Marlow	2009-08-02	1	-583/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The first phase of this tidyup is focussed on the header files, and in particular making sure we are exposinng publicly exactly what we need to, and no more. - Rts.h now includes everything that the RTS exposes publicly, rather than a random subset of it. - Most of the public header files have moved into subdirectories, and many of them have been renamed. But clients should not need to include any of the other headers directly, just #include the main public headers: Rts.h, HsFFI.h, RtsAPI.h. - All the headers needed for via-C compilation have moved into the stg subdirectory, which is self-contained. Most of the headers for the rest of the RTS APIs have moved into the rts subdirectory. - I left MachDeps.h where it is, because it is so widely used in Haskell code. - I left a deprecated stub for RtsFlags.h in place. The flag structures are now exposed by Rts.h. - Various internal APIs are no longer exposed by public header files. - Various bits of dead code and declarations have been removed - More gcc warnings are turned on, and the RTS code is more warning-clean. - More source files #include "PosixSource.h", and hence only use standard POSIX (1003.1c-1995) interfaces. There is a lot more tidying up still to do, this is just the first pass. I also intend to standardise the names for external RTS APIs (e.g use the rts_ prefix consistently), and declare the internal APIs as hidden for shared libraries.
*	remove unused cruft	Simon Marlow	2009-06-03	1	-1/+0
\|
*	Use work-stealing for load-balancing in the GC	Simon Marlow	2009-03-13	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	New flag: "+RTS -qb" disables load-balancing in the parallel GC (though this is subject to change, I think we will probably want to do something more automatic before releasing this). To get the "PARGC3" configuration described in the "Runtime support for Multicore Haskell" paper, use "+RTS -qg0 -qb -RTS". The main advantage of this is that it allows us to easily disable load-balancing altogether, which turns out to be important in parallel programs. Maintaining locality is sometimes more important that spreading the work out in parallel GC. There is a side benefit in that the parallel GC should have improved locality even when load-balancing, because each processor prefers to take work from its own queue before stealing from others.
*	Keep the remembered sets local to each thread during parallel GC	Simon Marlow	2009-01-12	1	-32/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This turns out to be quite vital for parallel programs: - The way we discover which threads to traverse is by finding dirty threads via the remembered sets (aka mutable lists). - A dirty thread will be on the remembered set of the capability that was running it, and we really want to traverse that thread's stack using the GC thread for the capability, because it is in that CPU's cache. If we get this wrong, we get penalised badly by the memory system. Previously we had per-capability mutable lists but they were aggregated before GC and traversed by just one of the GC threads. This resulted in very poor performance particularly for parallel programs with deep stacks. Now we keep per-capability remembered sets throughout GC, which also removes a lock (recordMutableGen_sync).
*	Use mutator threads to do GC, instead of having a separate pool of GC threads	Simon Marlow	2008-11-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, the GC had its own pool of threads to use as workers when doing parallel GC. There was a "leader", which was the mutator thread that initiated the GC, and the other threads were taken from the pool. This was simple and worked fine for sequential programs, where we did most of the benchmarking for the parallel GC, but falls down for parallel programs. When we have N mutator threads and N cores, at GC time we would have to stop N-1 mutator threads and start up N-1 GC threads, and hope that the OS schedules them all onto separate cores. It practice it doesn't, as you might expect. Now we use the mutator threads to do GC. This works quite nicely, particularly for parallel programs, where each mutator thread scans its own spark pool, which is probably in its cache anyway. There are some flag changes: -g<n> is removed (-g1 is still accepted for backwards compat). There's no way to have a different number of GC threads than mutator threads now. -q1 Use one OS thread for GC (turns off parallel GC) -qg<n> Use parallel GC for generations >= <n> (default: 1) Using parallel GC only for generations >=1 works well for sequential programs. Compiling an ordinary sequential program with -threaded and running it with -N2 or more should help if you do a lot of GC. I've found that adding -qg0 (do parallel GC for generation 0 too) speeds up some parallel programs, but slows down some sequential programs. Being conservative, I left the threshold at 1. ToDo: document the new options.
*	On Linux use libffi for allocating executable memory (fixed #738)	Simon Marlow	2008-09-19	1	-1/+1
\|
*	Make LOOKS_LIKE_{INFO,CLOSURE}_PTR into inline functions, instead of macros	Simon Marlow	2008-09-08	1	-9/+18
\| \| \| \| \| \|	The macros were duplicating their arguments, which was normally harmless, but in the parallel GC was actually wrong and caused spurious assertion failures.
*	FIX #2332: avoid overflow on 64-bit machines in the memory allocator	Simon Marlow	2008-07-29	1	-4/+4
\|
*	Experimental "mark-region" strategy for the old generation	Simon Marlow	2008-06-09	1	-1/+4
\| \| \| \|	Sometimes better than the default copying, enabled by +RTS -w
*	remove EVACUATED: store the forwarding pointer in the info pointer	Simon Marlow	2008-04-17	1	-1/+5
\|
*	Don't traverse the entire list of threads on every GC (phase 1)	Simon Marlow	2008-04-16	1	-0/+3
\| \| \| \| \| \|	Instead of keeping a single list of all threads, keep one per step and only look at the threads belonging to steps that we are collecting.
*	Reorganisation to fix problems related to the gct register variable	Simon Marlow	2008-04-16	1	-3/+4
\| \| \| \| \| \| \| \| \|	- GCAux.c contains code not compiled with the gct register enabled, it is callable from outside the GC - marking functions are moved to their relevant subsystems, outside the GC - mark_root needs to save the gct register, as it is called from outside the GC
*	improvements to +RTS -s output	Simon Marlow	2008-04-16	1	-0/+1
\| \| \| \| \| \| \|	- count and report number of parallel collections - calculate bytes scanned in addition to bytes copied per thread - calculate "work balance factor" - tidy up the formatting a bit
*	Keep track of an accurate count of live words in each step	Simon Marlow	2008-04-16	1	-0/+1
\| \| \| \| \|	This means we can calculate slop easily, and also improve predictability of GC.
*	Allow work units smaller than a block to improve load balancing	Simon Marlow	2008-04-16	1	-0/+3
\|
*	use RTS_VAR()	Simon Marlow	2008-04-16	1	-1/+1
\|
*	treat the global work list as a queue rather than a stack	Simon Marlow	2008-04-16	1	-0/+1
\|
*	GC: move static object processinng into thread-local storage	Simon Marlow	2008-04-16	1	-1/+0
\|
*	GC: rearrange storage to reduce memory accesses in the inner loop	Simon Marlow	2008-04-16	1	-6/+15
\|
*	Release some of the memory allocated to a stack when it shrinks (#2090)	simonmar@microsoft.com	2008-02-28	1	-0/+3
\| \| \| \| \| \|	When a stack is occupying less than 1/4 of the memory it owns, and is larger than a megablock, we release half of it. Shrinking is O(1), it doesn't need to copy the stack.
*	memInventory: optionally dump the memory inventory	simonmar@microsoft.com	2008-01-30	1	-1/+1
\| \| \| \|	in addition to checking for leaks
*	recordMutableGen_GC: we must call the spinlocked version of allocBlock()	Simon Marlow	2008-01-11	1	-1/+18
\|
*	calculate wastage due to unused memory at the end of each block	simonmar@microsoft.com	2007-12-14	1	-1/+3
\|
*	remove declarations for variables that no longer exist	simonmar@microsoft.com	2007-12-13	1	-3/+0
\|
*	Refactoring of the GC in preparation for parallel GC	Simon Marlow	2007-10-31	1	-40/+59
\| \| \| \| \| \| \| \| \| \| \| \|	This patch localises the state of the GC into a gc_thread structure, and reorganises the inner loop of the GC to scavenge one block at a time from global work lists in each "step". The gc_thread structure has a "workspace" for each step, in which it collects evacuated objects until it has a full block to push out to the step's global list. Details of the algorithm will be on the wiki in due course. At the moment, THREADED_RTS does not compile, but the single-threaded GC works (and is 10-20% slower than before).
*	move GetRoots() to GC.c	Simon Marlow	2007-10-30	1	-2/+2
\|
*	Fix warnings in the RTS	Ian Lynagh	2008-03-25	1	-1/+4
\| \| \| \|	For some reason this causes build failures for me in my 32-bit chroot,
*	Add allocateInGen() for allocating in a specific generation, and cleanups	Simon Marlow	2007-10-12	1	-4/+8
\| \| \| \| \| \| \| \|	Now allocate() is a synonym for allocateInGen(). I also made various cleanups: there is now less special-case code for supporting -G1 (two-space collection), and -G1 now works with -threaded.
*	Pointer Tagging	Simon Marlow	2007-07-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch implements pointer tagging as per our ICFP'07 paper "Faster laziness using dynamic pointer tagging". It improves performance by 10-15% for most workloads, including GHC itself. The original patches were by Alexey Rodriguez Yakushev <mrchebas@gmail.com>, with additions and improvements by me. I've re-recorded the development as a single patch. The basic idea is this: we use the low 2 bits of a pointer to a heap object (3 bits on a 64-bit architecture) to encode some information about the object pointed to. For a constructor, we encode the "tag" of the constructor (e.g. True vs. False), for a function closure its arity. This enables some decisions to be made without dereferencing the pointer, which speeds up some common operations. In particular it enables us to avoid costly indirect jumps in many cases. More information in the commentary: http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/HaskellExecution/PointerTagging
*	Remove vectored returns.	Simon Marlow	2007-02-28	1	-1/+0
\| \| \| \| \|	We recently discovered that they aren't a win any more, and just cost code size.
*	fix closure_sizeW_() for AP closures	Simon Marlow	2007-02-14	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	Since thunks grew an extra padding word in GHC 6.6, closure_sizeW() has been wrong for AP closures because it assumed compatible layout between PAPs and APs. One symptom is that the compacting GC would crash if it encountered an AP. APs conly crop up in GHCi or when using asynchronous exceptions. Fixes #1010
*	Split GC.c, and move storage manager into sm/ directory	Simon Marlow	2006-10-24	1	-3/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In preparation for parallel GC, split up the monolithic GC.c file into smaller parts. Also in this patch (and difficult to separate, unfortunatley): - Don't include Stable.h in Rts.h, instead just include it where necessary. - consistently use STATIC_INLINE in source files, and INLINE_HEADER in header files. STATIC_INLINE is now turned off when DEBUG is on, to make debugging easier. - The GC no longer takes the get_roots function as an argument. We weren't making use of this generalisation.
*	comments only: document allocateLocal()	Simon Marlow	2006-10-19	1	-2/+6
\|
*	rename allocated_bytes() to allocatedBytes()	Simon Marlow	2006-10-19	1	-2/+2
\|
*	STM invariants	tharris@microsoft.com	2006-10-07	1	-2/+6
\|
*	replace stgMallocBytesRWX() with our own allocator	Simon Marlow	2006-05-30	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	See bug #738 Allocating executable memory is getting more difficult these days. In particular, the default SELinux policy on Fedora Core 5 disallows making the heap (i.e. malloc()'d memory) executable, although it does apparently allow mmap()'ing anonymous executable memory by default. Previously, stgMallocBytesRWX() used malloc() underneath, and then tried to make the page holding the memory executable. This was rather hacky and fails with Fedora Core 5. This patch adds a mini-allocator for executable memory, based on the block allocator. We grab page-sized blocks and make them executable, then allocate small objects from the page. There's a simple free function, that will free whole pages back to the system when they are empty.
*	Reorganisation of the source tree	Simon Marlow	2006-04-07	1	-0/+518
	Most of the other users of the fptools build system have migrated to Cabal, and with the move to darcs we can now flatten the source tree without losing history, so here goes. The main change is that the ghc/ subdir is gone, and most of what it contained is now at the top level. The build system now makes no pretense at being multi-project, it is just the GHC build system. No doubt this will break many things, and there will be a period of instability while we fix the dependencies. A straightforward build should work, but I haven't yet fixed binary/source distributions. Changes to the Building Guide will follow, too.