summaryrefslogtreecommitdiff
path: root/doc/gcdescr.html
blob: 08ca2a8f3631238103d6a89854bb28fef66648ff (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
<HTML>
<HEAD>
    <TITLE> Conservative GC Algorithmic Overview </TITLE>
    <AUTHOR> Hans-J. Boehm, HP Labs (Some of this was written at SGI)</author>
</HEAD>
<BODY>
<H1> <I>This is under construction, and may always be.</i> </h1>
<H1> Conservative GC Algorithmic Overview </h1>
<P>
This is a description of the algorithms and data structures used in our
conservative garbage collector.  I expect the level of detail to increase
with time.  For a survey of GC algorithms, see for example
<A HREF="ftp://ftp.cs.utexas.edu/pub/garbage/gcsurvey.ps"> Paul Wilson's
excellent paper</a>.  For an overview of the collector interface,
see <A HREF="gcinterface.html">here</a>.
<P>
This description is targeted primarily at someone trying to understand the
source code.  It specifically refers to variable and function names.
It may also be useful for understanding the algorithms at a higher level.
<P>
The description here assumes that the collector is used in default mode.
In particular, we assume that it used as a garbage collector, and not just
a leak detector.  We initially assume that it is used in stop-the-world,
non-incremental mode, though the presence of the incremental collector
will be apparent in the design.
We assume the default finalization model, but the code affected by that
is very localized.
<H2> Introduction </h2>
The garbage collector uses a modified mark-sweep algorithm.  Conceptually
it operates roughly in four phases, which are performed occasionally
as part of a memory allocation:

<OL>

<LI>
<I>Preparation</i> Each object has an associated mark bit.
Clear all mark bits, indicating that all objects
are potentially unreachable.

<LI>
<I>Mark phase</i> Marks all objects that can be reachable via chains of
pointers from variables.  Often the collector has no real information
about the location of pointer variables in the heap, so it
views all static data areas, stacks and registers as potentially containing
pointers.  Any bit patterns that represent addresses inside
heap objects managed by the collector are viewed as pointers.
Unless the client program has made heap object layout information
available to the collector, any heap objects found to be reachable from
variables are again scanned similarly.

<LI>
<I>Sweep phase</i> Scans the heap for inaccessible, and hence unmarked,
objects, and returns them to an appropriate free list for reuse.  This is
not really a separate phase; even in non incremental mode this is operation
is usually performed on demand during an allocation that discovers an empty
free list.  Thus the sweep phase is very unlikely to touch a page that
would not have been touched shortly thereafter anyway.

<LI>
<I>Finalization phase</i> Unreachable objects which had been registered
for finalization are enqueued for finalization outside the collector.

</ol>

<P>
The remaining sections describe the memory allocation data structures,
and then the last 3 collection phases in more detail. We conclude by
outlining some of the additional features implemented in the collector.

<H2>Allocation</h2>
The collector includes its own memory allocator.  The allocator obtains
memory from the system in a platform-dependent way.  Under UNIX, it
uses either <TT>malloc</tt>, <TT>sbrk</tt>, or <TT>mmap</tt>.
<P>
Most static data used by the allocator, as well as that needed by the
rest of the garbage collector is stored inside the
<TT>_GC_arrays</tt> structure.
This allows the garbage collector to easily ignore the collectors own
data structures when it searches for root pointers.  Other allocator
and collector internal data structures are allocated dynamically
with <TT>GC_scratch_alloc</tt>. <TT>GC_scratch_alloc</tt> does not
allow for deallocation, and is therefore used only for permanent data
structures.
<P>
The allocator allocates objects of different <I>kinds</i>.
Different kinds are handled somewhat differently by certain parts
of the garbage collector.  Certain kinds are scanned for pointers,
others are not.  Some may have per-object type descriptors that
determine pointer locations.  Or a specific kind may correspond
to one specific object layout.  Two built-in kinds are uncollectable.
One (<TT>STUBBORN</tt>) is immutable without special precautions.
In spite of that, it is very likely that most C clients of the
collector currently
use at most two kinds: <TT>NORMAL</tt> and <TT>PTRFREE</tt> objects.
The <A HREF="http://gcc.gnu.org/java">gcj</a> runtime also makes
heavy use of a kind (allocated with GC_gcj_malloc) that stores
type information at a known offset in method tables.
<P>
The collector uses a two level allocator.  A large block is defined to
be one larger than half of <TT>HBLKSIZE</tt>, which is a power of 2,
typically on the order of the page size.
<P>
Large block sizes are rounded up to
the next multiple of <TT>HBLKSIZE</tt> and then allocated by
<TT>GC_allochblk</tt>.  Recent versions of the collector
use an approximate best fit algorithm by keeping free lists for
several large block sizes.
The actual
implementation of <TT>GC_allochblk</tt>
is significantly complicated by black-listing issues
(see below).
<P>
Small blocks are allocated in chunks of size <TT>HBLKSIZE</tt>.
Each chunk is
dedicated to only one object size and kind.
<P>
The allocator maintains
separate free lists for each size and kind of object.
Associated with each kind is an array of free list pointers,
with entry <TT>freelist[</tt><I>i</i><TT>]</tt> pointing to
a free list of size <I>i</i> objects.
In recent versions of the
collector, index <TT>i</tt> is expressed in granules, which are the
minimum allocatable unit, typically 8 or 16 bytes.
The free lists themselves are
linked through the first word in each object (see <TT>obj_link()</tt>
macro).
<P>
Once a large block is split for use in smaller objects, it can only
be used for objects of that size, unless the collector discovers a completely
empty chunk.  Completely empty chunks are restored to the appropriate
large block free list.
<P>
In order to avoid allocating blocks for too many distinct object sizes,
the collector normally does not directly allocate objects of every possible
request size.  Instead request are rounded up to one of a smaller number
of allocated sizes, for which free lists are maintained.  The exact
allocated sizes are computed on demand, but subject to the constraint
that they increase roughly in geometric progression.  Thus objects
requested early in the execution are likely to be allocated with exactly
the requested size, subject to alignment constraints.
See <TT>GC_init_size_map</tt> for details.
<P>
The actual size rounding operation during small object allocation is
implemented as a table lookup in <TT>GC_size_map</tt> which maps
a requested allocation size in bytes to a number of granules.
<P>
Both collector initialization and computation of allocated sizes are
handled carefully so that they do not slow down the small object fast
allocation path.  An attempt to allocate before the collector is initialized,
or before the appropriate <TT>GC_size_map</tt> entry is computed,
will take the same path as an allocation attempt with an empty free list.
This results in a call to the slow path code (<TT>GC_generic_malloc_inner</tt>)
which performs the appropriate initialization checks.
<P>
In non-incremental mode, we make a decision about whether to garbage collect
whenever an allocation would otherwise have failed with the current heap size.
If the total amount of allocation since the last collection is less than
the heap size divided by <TT>GC_free_space_divisor</tt>, we try to
expand the heap.  Otherwise, we initiate a garbage collection.  This ensures
that the amount of garbage collection work per allocated byte remains
constant.
<P>
The above is in fact an oversimplification of the real heap expansion
and GC triggering heuristic, which adjusts slightly for root size
and certain kinds of
fragmentation.  In particular:
<UL>
<LI> Programs with a large root set size and
little live heap memory will expand the heap to amortize the cost of
scanning the roots.
<LI> Versions 5.x of the collector actually collect more frequently in
nonincremental mode.  The large block allocator usually refuses to split
large heap blocks once the garbage collection threshold is
reached.  This often has the effect of collecting well before the
heap fills up, thus reducing fragmentation and working set size at the
expense of GC time.  Versions 6.x choose an intermediate strategy depending
on how much large object allocation has taken place in the past.
(If the collector is configured to unmap unused pages, versions 6.x
use the 5.x strategy.)
<LI> In calculating the amount of allocation since the last collection we
give partial credit for objects we expect to be explicitly deallocated.
Even if all objects are explicitly managed, it is often desirable to collect
on rare occasion, since that is our only mechanism for coalescing completely
empty chunks.
</ul>
<P>
It has been suggested that this should be adjusted so that we favor
expansion if the resulting heap still fits into physical memory.
In many cases, that would no doubt help.  But it is tricky to do this
in a way that remains robust if multiple application are contending
for a single pool of physical memory.

<H2>Mark phase</h2>

At each collection, the collector marks all objects that are
possibly reachable from pointer variables.  Since it cannot generally
tell where pointer variables are located, it scans the following
<I>root segments</i> for pointers:
<UL>
<LI>The registers.  Depending on the architecture, this may be done using
assembly code, or by calling a <TT>setjmp</tt>-like function which saves
register contents on the stack.
<LI>The stack(s).  In the case of a single-threaded application,
on most platforms this
is done by scanning the memory between (an approximation of) the current
stack pointer and <TT>GC_stackbottom</tt>.  (For Itanium, the register stack
scanned separately.)  The <TT>GC_stackbottom</tt> variable is set in
a highly platform-specific way depending on the appropriate configuration
information in <TT>gcconfig.h</tt>.  Note that the currently active
stack needs to be scanned carefully, since callee-save registers of
client code may appear inside collector stack frames, which may
change during the mark process.  This is addressed by scanning
some sections of the stack "eagerly", effectively capturing a snapshot
at one point in time.
<LI>Static data region(s).  In the simplest case, this is the region
between <TT>DATASTART</tt> and <TT>DATAEND</tt>, as defined in
<TT>gcconfig.h</tt>.  However, in most cases, this will also involve
static data regions associated with dynamic libraries.  These are
identified by the mostly platform-specific code in <TT>dyn_load.c</tt>.
</ul>
The marker maintains an explicit stack of memory regions that are known
to be accessible, but that have not yet been searched for contained pointers.
Each stack entry contains the starting address of the block to be scanned,
as well as a descriptor of the block.  If no layout information is
available for the block, then the descriptor is simply a length.
(For other possibilities, see <TT>gc_mark.h</tt>.)
<P>
At the beginning of the mark phase, all root segments
(as described above) are pushed on the
stack by <TT>GC_push_roots</tt>.  (Registers and eagerly processed
stack sections are processed by pushing the referenced objects instead
of the stack section itself.)  If <TT>ALL_INTERIOR_PTRS</tt> is not
defined, then stack roots require special treatment.  In this case, the
normal marking code ignores interior pointers, but <TT>GC_push_all_stack</tt>
explicitly checks for interior pointers and pushes descriptors for target
objects.
<P>
The marker is structured to allow incremental marking.
Each call to <TT>GC_mark_some</tt> performs a small amount of
work towards marking the heap.
It maintains
explicit state in the form of <TT>GC_mark_state</tt>, which
identifies a particular sub-phase.  Some other pieces of state, most
notably the mark stack, identify how much work remains to be done
in each sub-phase.  The normal progression of mark states for
a stop-the-world collection is:
<OL>
<LI> <TT>MS_INVALID</tt> indicating that there may be accessible unmarked
objects.  In this case <TT>GC_objects_are_marked</tt> will simultaneously
be false, so the mark state is advanced to
<LI> <TT>MS_PUSH_UNCOLLECTABLE</tt> indicating that it suffices to push
uncollectable objects, roots, and then mark everything reachable from them.
<TT>Scan_ptr</tt> is advanced through the heap until all uncollectable
objects are pushed, and objects reachable from them are marked.
At that point, the next call to <TT>GC_mark_some</tt> calls
<TT>GC_push_roots</tt> to push the roots.  It the advances the
mark state to
<LI> <TT>MS_ROOTS_PUSHED</tt> asserting that once the mark stack is
empty, all reachable objects are marked.  Once in this state, we work
only on emptying the mark stack.  Once this is completed, the state
changes to
<LI> <TT>MS_NONE</tt> indicating that reachable objects are marked.
</ol>

The core mark routine <TT>GC_mark_from</tt>, is called
repeatedly by several of the sub-phases when the mark stack starts to fill
up.  It is also called repeatedly in <TT>MS_ROOTS_PUSHED</tt> state
to empty the mark stack.
The routine is designed to only perform a limited amount of marking at
each call, so that it can also be used by the incremental collector.
It is fairly carefully tuned, since it usually consumes a large majority
of the garbage collection time.
<P>
The fact that it perform a only a small amount of work per call also
allows it to be used as the core routine of the parallel marker.  In that
case it is normally invoked on thread-private mark stacks instead of the
global mark stack.  More details can be found in
<A HREF="scale.html">scale.html</a>
<P>
The marker correctly handles mark stack overflows.  Whenever the mark stack
overflows, the mark state is reset to <TT>MS_INVALID</tt>.
Since there are already marked objects in the heap,
this eventually forces a complete
scan of the heap, searching for pointers, during which any unmarked objects
referenced by marked objects are again pushed on the mark stack.  This
process is repeated until the mark phase completes without a stack overflow.
Each time the stack overflows, an attempt is made to grow the mark stack.
All pieces of the collector that push regions onto the mark stack have to be
careful to ensure forward progress, even in case of repeated mark stack
overflows.  Every mark attempt results in additional marked objects.
<P>
Each mark stack entry is processed by examining all candidate pointers
in the range described by the entry.  If the region has no associated
type information, then this typically requires that each 4-byte aligned
quantity (8-byte aligned with 64-bit pointers) be considered a candidate
pointer.
<P>
We determine whether a candidate pointer is actually the address of
a heap block.  This is done in the following steps:
<NL>
<LI> The candidate pointer is checked against rough heap bounds.
These heap bounds are maintained such that all actual heap objects
fall between them.  In order to facilitate black-listing (see below)
we also include address regions that the heap is likely to expand into.
Most non-pointers fail this initial test.
<LI> The candidate pointer is divided into two pieces; the most significant
bits identify a <TT>HBLKSIZE</tt>-sized page in the address space, and
the least significant bits specify an offset within that page.
(A hardware page may actually consist of multiple such pages.
HBLKSIZE is usually the page size divided by a small power of two.)
<LI>
The page address part of the candidate pointer is looked up in a
<A HREF="tree.html">table</a>.
Each table entry contains either 0, indicating that the page is not part
of the garbage collected heap, a small integer <I>n</i>, indicating
that the page is part of large object, starting at least <I>n</i> pages
back, or a pointer to a descriptor for the page.  In the first case,
the candidate pointer i not a true pointer and can be safely ignored.
In the last two cases, we can obtain a descriptor for the page containing
the beginning of the object.
<LI>
The starting address of the referenced object is computed.
The page descriptor contains the size of the object(s)
in that page, the object kind, and the necessary mark bits for those
objects.  The size information can be used to map the candidate pointer
to the object starting address.  To accelerate this process, the page header
also contains a pointer to a precomputed map of page offsets to displacements
from the beginning of an object.  The use of this map avoids a
potentially slow integer remainder operation in computing the object
start address.
<LI>
The mark bit for the target object is checked and set.  If the object
was previously unmarked, the object is pushed on the mark stack.
The descriptor is read from the page descriptor.  (This is computed
from information <TT>GC_obj_kinds</tt> when the page is first allocated.)
</nl>
<P>
At the end of the mark phase, mark bits for left-over free lists are cleared,
in case a free list was accidentally marked due to a stray pointer.

<H2>Sweep phase</h2>

At the end of the mark phase, all blocks in the heap are examined.
Unmarked large objects are immediately returned to the large object free list.
Each small object page is checked to see if all mark bits are clear.
If so, the entire page is returned to the large object free list.
Small object pages containing some reachable object are queued for later
sweeping, unless we determine that the page contains very little free
space, in which case it is not examined further.
<P>
This initial sweep pass touches only block headers, not
the blocks themselves.  Thus it does not require significant paging, even
if large sections of the heap are not in physical memory.
<P>
Nonempty small object pages are swept when an allocation attempt
encounters an empty free list for that object size and kind.
Pages for the correct size and kind are repeatedly swept until at
least one empty block is found.  Sweeping such a page involves
scanning the mark bit array in the page header, and building a free
list linked through the first words in the objects themselves.
This does involve touching the appropriate data page, but in most cases
it will be touched only just before it is used for allocation.
Hence any paging is essentially unavoidable.
<P>
Except in the case of pointer-free objects, we maintain the invariant
that any object in a small object free list is cleared (except possibly
for the link field).  Thus it becomes the burden of the small object
sweep routine to clear objects.  This has the advantage that we can
easily recover from accidentally marking a free list, though that could
also be handled by other means.  The collector currently spends a fair
amount of time clearing objects, and this approach should probably be
revisited.
<P>
In most configurations, we use specialized sweep routines to handle common
small object sizes.  Since we allocate one mark bit per word, it becomes
easier to examine the relevant mark bits if the object size divides
the word length evenly.  We also suitably unroll the inner sweep loop
in each case.  (It is conceivable that profile-based procedure cloning
in the compiler could make this unnecessary and counterproductive.  I
know of no existing compiler to which this applies.)
<P>
The sweeping of small object pages could be avoided completely at the expense
of examining mark bits directly in the allocator.  This would probably
be more expensive, since each allocation call would have to reload
a large amount of state (e.g. next object address to be swept, position
in mark bit table) before it could do its work.  The current scheme
keeps the allocator simple and allows useful optimizations in the sweeper.

<H2>Finalization</h2>
Both <TT>GC_register_disappearing_link</tt> and
<TT>GC_register_finalizer</tt> add the request to a corresponding hash
table.  The hash table is allocated out of collected memory, but
the reference to the finalizable object is hidden from the collector.
Currently finalization requests are processed non-incrementally at the
end of a mark cycle.
<P>
The collector makes an initial pass over the table of finalizable objects,
pushing the contents of unmarked objects onto the mark stack.
After pushing each object, the marker is invoked to mark all objects
reachable from it.  The object itself is not explicitly marked.
This assures that objects on which a finalizer depends are neither
collected nor finalized.
<P>
If in the process of marking from an object the
object itself becomes marked, we have uncovered
a cycle involving the object.  This usually results in a warning from the
collector.  Such objects are not finalized, since it may be
unsafe to do so.  See the more detailed
<A HREF="http://www.hpl.hp.com/personal/Hans_Boehm/gc/finalization.html"> discussion of finalization semantics</a>.
<P>
Any objects remaining unmarked at the end of this process are added to
a queue of objects whose finalizers can be run.  Depending on collector
configuration, finalizers are dequeued and run either implicitly during
allocation calls, or explicitly in response to a user request.
(Note that the former is unfortunately both the default and not generally safe.
If finalizers perform synchronization, it may result in deadlocks.
Nontrivial finalizers generally need to perform synchronization, and
thus require a different collector configuration.)
<P>
The collector provides a mechanism for replacing the procedure that is
used to mark through objects.  This is used both to provide support for
Java-style unordered finalization, and to ignore certain kinds of cycles,
<I>e.g.</i> those arising from C++ implementations of virtual inheritance.

<H2>Generational Collection and Dirty Bits</h2>
We basically use the concurrent and generational GC algorithm described in
<A HREF="http://www.hpl.hp.com/personal/Hans_Boehm/gc/papers/pldi91.ps.Z">"Mostly Parallel Garbage Collection"</a>,
by Boehm, Demers, and Shenker.
<P>
The most significant modification is that
the collector always starts running in the allocating thread.
There is no separate garbage collector thread.  (If parallel GC is
enabled, helper threads may also be woken up.)
If an allocation attempt either requests a large object, or encounters
an empty small object free list, and notices that there is a collection
in progress, it immediately performs a small amount of marking work
as described above.
<P>
This change was made both because we wanted to easily accommodate
single-threaded environments, and because a separate GC thread requires
very careful control over the scheduler to prevent the mutator from
out-running the collector, and hence provoking unneeded heap growth.
<P>
In incremental mode, the heap is always expanded when we encounter
insufficient space for an allocation.  Garbage collection is triggered
whenever we notice that more than
<TT>GC_heap_size</tt>/2 * <TT>GC_free_space_divisor</tt>
bytes of allocation have taken place.
After <TT>GC_full_freq</tt> minor collections a major collection
is started.
<P>
All collections initially run uninterrupted until a predetermined
amount of time (50 msecs by default) has expired.  If this allows
the collection to complete entirely, we can avoid correcting
for data structure modifications during the collection.  If it does
not complete, we return control to the mutator, and perform small
amounts of additional GC work during those later allocations that
cannot be satisfied from small object free lists. When marking completes,
the set of modified pages is retrieved, and we mark once again from
marked objects on those pages, this time with the mutator stopped.
<P>
We keep track of modified pages using one of several distinct mechanisms:
<OL>
<LI>
Through explicit mutator cooperation.  Currently this requires
the use of <TT>GC_malloc_stubborn</tt>, and is rarely used.
<LI>
(<TT>MPROTECT_VDB</tt>) By write-protecting physical pages and
catching write faults.  This is
implemented for many Unix-like systems and for win32.  It is not possible
in a few environments.
<LI>
(<TT>PROC_VDB</tt>) By retrieving dirty bit information from /proc.
(Currently only Sun's
Solaris supports this.  Though this is considerably cleaner, performance
may actually be better with mprotect and signals.)
<LI>
(<TT>PCR_VDB</tt>) By relying on an external dirty bit implementation, in this
case the one in Xerox PCR.
<LI>
(<TT>DEFAULT_VDB</tt>) By treating all pages as dirty.  This is the default if
none of the other techniques is known to be usable, and
<TT>GC_malloc_stubborn</tt> is not used.  Practical only for testing, or if
the vast majority of objects use <TT>GC_malloc_stubborn</tt>.
</ol>

<H2>Black-listing</h2>

The collector implements <I>black-listing</i> of pages, as described
in
<A HREF="http://www.acm.org/pubs/citations/proceedings/pldi/155090/p197-boehm/">
Boehm, ``Space Efficient Conservative Collection'', PLDI '93</a>, also available
<A HREF="papers/pldi93.ps.Z">here</a>.
<P>
During the mark phase, the collector tracks ``near misses'', i.e. attempts
to follow a ``pointer'' to just outside the garbage-collected heap, or
to a currently unallocated page inside the heap.  Pages that have been
the targets of such near misses are likely to be the targets of
misidentified ``pointers'' in the future.  To minimize the future
damage caused by such misidentifications they will be allocated only to
small pointerfree objects.
<P>
The collector understands two different kinds of black-listing.  A
page may be black listed for interior pointer references
(<TT>GC_add_to_black_list_stack</tt>), if it was the target of a near
miss from a location that requires interior pointer recognition,
<I>e.g.</i> the stack, or the heap if <TT>GC_all_interior_pointers</tt>
is set.  In this case, we also avoid allocating large blocks that include
this page.
<P>
If the near miss came from a source that did not require interior
pointer recognition, it is black-listed with
<TT>GC_add_to_black_list_normal</tt>.
A page black-listed in this way may appear inside a large object,
so long as it is not the first page of a large object.
<P>
The <TT>GC_allochblk</tt> routine respects black-listing when assigning
a block to a particular object kind and size.  It occasionally
drops (i.e. allocates and forgets) blocks that are completely black-listed
in order to avoid excessively long large block free lists containing
only unusable blocks.  This would otherwise become an issue
if there is low demand for small pointerfree objects.

<H2>Thread support</h2>
We support several different threading models.  Unfortunately Pthreads,
the only reasonably well standardized thread model, supports too narrow
an interface for conservative garbage collection.  There appears to be
no completely portable way to allow the collector
to coexist with various Pthreads
implementations.  Hence we currently support only the more
common Pthreads implementations.
<P>
In particular, it is very difficult for the collector to stop all other
threads in the system and examine the register contents.  This is currently
accomplished with very different mechanisms for some Pthreads
implementations.  The Solaris implementation temporarily disables much
of the user-level threads implementation by stopping kernel-level threads
("lwp"s).  The Linux/HPUX/OSF1 and Irix implementations sends signals to
individual Pthreads and has them wait in the signal handler.
<P>
The Linux and Irix implementations use
only documented Pthreads calls, but rely on extensions to their semantics.
The Linux implementation <TT>linux_threads.c</tt> relies on only very
mild extensions to the pthreads semantics, and already supports a large number
of other Unix-like pthreads implementations.  Our goal is to make this the
only pthread support in the collector.
<P>
(The Irix implementation is separate only for historical reasons and should
clearly be merged.  The current Solaris implementation probably performs
better in the uniprocessor case, but does not support thread operations in the
collector.  Hence it cannot support the parallel marker.)
<P>
All implementations must
intercept thread creation and a few other thread-specific calls to allow
enumeration of threads and location of thread stacks.  This is current
accomplished with <TT># define</tt>'s in <TT>gc.h</tt>
(really <TT>gc_pthread_redirects.h</tt>), or optionally
by using ld's function call wrapping mechanism under Linux.
<P>
Recent versions of the collector support several facilities to enhance
the processor-scalability and thread performance of the collector.
These are discussed in more detail <A HREF="scale.html">here</a>.
We briefly outline the data approach to thread-local allocation in the
next section.
<H2>Thread-local allocation</h2>
If thread-local allocation is enabled, the collector keeps separate
arrays of free lists for each thread.  Thread-local allocation
is currently only supported on a few platforms.
<P>
The free list arrays associated
with each thread are only used to satisfy requests for objects that
are  both very small, and belong to one of a small number of well-known
kinds.  These currently include "normal" and pointer-free objects.
Depending on the configuration, "gcj" objects may also be included.
<P>
Thread-local free list entries contain either a pointer to the first
element of a free list, or they contain a counter of the number of
allocation granules, corresponding to objects of this size,
allocated so far.  Initially they contain the
value one, i.e. a small counter value.
<P>
Thread-local allocation allocates directly through the global
allocator, if the object is of a size or kind not covered by the
local free lists.
<P>
If there is an appropriate local free list, the allocator checks whether it
contains a sufficiently small counter value.  If so, the counter is simply
incremented by the counter value, and the global allocator is used.
In this way, the initial few allocations of a given size bypass the local
allocator.  A thread that only allocates a handful of objects of a given
size will not build up its own free list for that size.  This avoids
wasting space for unpopular objects sizes or kinds.
<P>
Once the counter passes a threshold, <TT>GC_malloc_many</tt> is called
to allocate roughly <TT>HBLKSIZE</tt> space and put it on the corresponding
local free list.  Further allocations of that size and kind then use
this free list, and no longer need to acquire the allocation lock.
The allocation procedure is otherwise similar to the global free lists.
The local free lists are also linked using the first word in the object.
In most cases this means they require considerably less time.
<P>
Local free lists are treated buy most of the rest of the collector
as though they were in-use reachable data.  This requires some care,
since pointer-free objects are not normally traced, and hence a special
tracing procedure is required to mark all objects on pointer-free and
gcj local free lists.
<P>
On thread exit, any remaining thread-local free list entries are
transferred back to the global free list.
<P>
Note that if the collector is configured for thread-local allocation,
GC versions before 7 do not invoke the thread-local allocator by default.
<TT>GC_malloc</tt> only uses thread-local allocation in version 7 and later.
<P>
For some more details see <A HREF="scale.html">here</a>, and the
technical report entitled
<A HREF="http://www.hpl.hp.com/techreports/2000/HPL-2000-165.html">
``Fast Multiprocessor Memory Allocation and Garbage Collection''
</a>
<P>
<HR>
<P>
Comments are appreciated.  Please send mail to
<A HREF="mailto:boehm@acm.org"><TT>boehm@acm.org</tt></a> or
<A HREF="mailto:Hans.Boehm@hp.com"><TT>Hans.Boehm@hp.com</tt></a>
<P>
This is a modified copy of a page written while the author was at SGI.
The original was <A HREF="http://reality.sgi.com/boehm/gcdescr.html">here</a>.
</body>
</html>