README_details.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254

Usage:

0) If possible, do this on a multiprocessor, especially if you are planning
on modifying or enhancing the package.  It will work on a uniprocessor,
but the tests are much more likely to pass in the presence of serious problems.

1) Type ./configure --prefix=<install dir>; make; make check
in the directory containing unpacked source.  The usual GNU build machinery
is used, except that only static, but position-independent, libraries
are normally built.  On Windows, follow README_win32.txt to use
src/Makefile.msft instead of the above sequence.

Alternatively, the libraries could be built with CMake, even for Windows,
like this:
> mkdir out
> cd out
> cmake -Dbuild_tests=ON ..
> cmake --build . --config Release
> ctest --build-config Release

The available options in the CMake script to customize the build is roughly
the same as those in the configure one, please see the exact option list in
CMakeLists.txt file.

2) Applications should include atomic_ops.h.  Nearly all operations
are implemented by header files included from it.  It is sometimes
necessary, and always recommended to also link against libatomic_ops.a.
To use the almost non-blocking stack or malloc implementations,
see the corresponding README files, and also link against libatomic_ops_gpl.a
before linking against libatomic_ops.a.

OVERVIEW:
Atomic_ops.h defines a large collection of operations, each one of which is
a combination of an (optional) atomic memory operation, and a memory barrier.
Also defines associated feature-test macros to determine whether a particular
operation is available on the current target hardware (either directly or
by synthesis).  This is an attempt to replace various existing files with
similar goals, since they usually do not handle differences in memory
barrier styles with sufficient generality.

If this is included after defining AO_REQUIRE_CAS, then the package makes
an attempt to emulate [fetch_]compare_and_swap* (single-width) in a way that,
at least on Linux, should still be async-signal-safe.  As a result, most
other atomic operations may then be defined using the compare-and-swap
emulation.  This emulation is slow, since it needs to disable signals.
And it needs to block in case of contention.  If you care about performance
on a platform that can't directly provide compare-and-swap, there are
probably better alternatives.  But this allows easy ports to some such
platforms (e.g. PA_RISC).  The option is ignored if compare-and-swap
can be implemented directly.

If atomic_ops.h is included after defining AO_USE_PTHREAD_DEFS, then all
atomic operations will be emulated with pthread locking.  This is NOT
async-signal-safe.  And it is slow.  It is intended primarily for debugging
of the atomic_ops package itself.

Note that the implementation reflects our understanding of real processor
behavior.  This occasionally diverges from the documented behavior.  (E.g.
the documented X86 behavior seems to be weak enough that it is impractical
to use.  Current real implementations appear to be much better behaved.)
We of course are in no position to guarantee that future processors
(even HPs) will continue to behave this way, though we hope they will.

Corrections/additions for other platforms are greatly appreciated.

OPERATIONS:

Most operations handle values of type AO_t, which are unsigned integers
whose size matches that of pointers on the given architecture.  Additionally,
on most supported architectures the operations are also implemented to handle
smaller integers types; such operations are indicated by the appropriate size
prefix:
- char_... Operates on unsigned char values;
- short_... Operates on unsigned short values;
- int_... Operates on unsigned int values.

The notable exception is AO_test_and_set operating only on AO_TS_t, which is
whatever size the hardware supports with good performance.  In some cases this
is the length of a cache line, in some other cases it is a byte.  In many
cases AO_TS_t is equivalent to AO_t.

The defined operations are all of the form AO_[<size>]<op><barrier>(<args>).

The <op> component specifies an atomic memory operation.  It may be
one of the following, where the corresponding argument and result types
are also specified:

void nop()
        No atomic operation.  The barrier may still be useful.
AO_t load(const volatile AO_t * addr)
        Atomic load of *addr.
void store(volatile AO_t * addr, AO_t new_val)
        Atomically store new_val to *addr.
AO_t fetch_and_add(volatile AO_t *addr, AO_t incr)
        Atomically add incr to *addr, and return the original value of *addr.
AO_t fetch_and_add1(volatile AO_t *addr)
        Equivalent to AO_fetch_and_add(addr, 1).
AO_t fetch_and_sub1(volatile AO_t *addr)
        Equivalent to AO_fetch_and_add(addr, (AO_t)(-1)).
void and(volatile AO_t *addr, AO_t value)
        Atomically 'and' value into *addr.
void or(volatile AO_t *addr, AO_t value)
        Atomically 'or' value into *addr.
void xor(volatile AO_t *addr, AO_t value)
        Atomically 'xor' value into *addr.
int compare_and_swap(volatile AO_t * addr, AO_t old_val, AO_t new_val)
        Atomically compare *addr to old_val, and replace *addr by new_val
        if the first comparison succeeds; returns nonzero if the comparison
        succeeded and *addr was updated; cannot fail spuriously.
AO_t fetch_compare_and_swap(volatile AO_t * addr, AO_t old_val, AO_t new_val)
        Atomically compare *addr to old_val, and replace *addr by new_val
        if the first comparison succeeds; returns the original value of *addr;
        cannot fail spuriously.
AO_TS_VAL_t test_and_set(volatile AO_TS_t * addr)
        Atomically read the binary value at *addr, and set it.  AO_TS_VAL_t
        is an enumeration type which includes two values AO_TS_SET and
        AO_TS_CLEAR.  An AO_TS_t location is capable of holding an
        AO_TS_VAL_t, but may be much larger, as dictated by hardware
        constraints.  Test_and_set logically sets the value to AO_TS_SET.
        It may be reset to AO_TS_CLEAR with the AO_CLEAR(AO_TS_t *) macro.
        AO_TS_t locations should be initialized to AO_TS_INITIALIZER.
        The values of AO_TS_SET and AO_TS_CLEAR are hardware dependent.
        (On PA-RISC, AO_TS_SET is zero!)

Test_and_set is a more limited version of compare_and_swap.  Its only
advantage is that it is more easily implementable on some hardware.  It
should thus be used if only binary test-and-set functionality is needed.

If available, we also provide compare_and_swap operations that operate
on wider values.  Since standard data types for double width values
may not be available, these explicitly take pairs of arguments for the
new and/or old value.  Unfortunately, there are two common variants,
neither of which can easily and efficiently emulate the other.
The first performs a comparison against the entire value being replaced,
where the second replaces a double-width replacement, but performs
a single-width comparison:

int compare_double_and_swap_double(volatile AO_double_t * addr,
                                   AO_t old_val1, AO_t old_val2,
                                   AO_t new_val1, AO_t new_val2);

int compare_and_swap_double(volatile AO_double_t * addr,
                            AO_t old_val1,
                            AO_t new_val1, AO_t new_val2);

where AO_double_t is a structure containing AO_val1 and AO_val2 fields,
both of type AO_t.  For compare_and_swap_double, we compare against
the val1 field.  AO_double_t exists only if AO_HAVE_double_t
is defined.  If this type is available then the following operation is
provided for convenience, fully equivalent to compare_double_and_swap_double:

int double_compare_and_swap(volatile AO_double_t * addr,
                            AO_double_t old_val, AO_double_t new_val)

Please note that AO_double_t (and AO_stack_t) variables should be properly
aligned (8-byte alignment on 32-bit targets, 16-byte alignment on 64-bit ones)
otherwise the behavior of a double-wide atomic primitive might be undefined
(or an assertion violation might occur) if such a misaligned variable is
passed (as a reference) to the primitive.  Global and static variables should
already have proper alignment automatically but automatic variables (i.e.
located on the stack) might be misaligned because the stack might be
word-aligned (e.g. 4-byte stack alignment is the default one for x86).
Luckily, stack-allocated AO variables operated atomically are used rarely
in practice.

ORDERING CONSTRAINTS:

Each operation name also includes a suffix that specifies the associated
ordering semantics.  The ordering constraint limits reordering of this
operation with respect to other atomic operations and ordinary memory
references.  The current implementation assumes that all memory references
are to ordinary cacheable memory; the ordering guarantee is with respect
to other threads or processes, not I/O devices.  (Whether or not this
distinction is important is platform-dependent.)

Ordering suffixes are one of the following:

<none>: No memory barrier.  A plain AO_nop() really does nothing.
_release: Earlier operations must become visible to other threads
          before the atomic operation.
_acquire: Later operations must become visible after this operation.
_read: Subsequent reads must become visible after reads included in
       the atomic operation or preceding it.  Rarely useful for clients?
_write: Earlier writes become visible before writes during or after
        the atomic operation.  Rarely useful for clients?
_full: The associated operation is ordered with respect to both earlier and
       later memory ops.  If the associated operation is nop, then this orders
       all earlier memory operations with respect to subsequent ones.
       AO_store_full or AO_nop_full are the normal ways to force a store
       to be ordered with respect to a later load.
_release_write: Ordered with respect to earlier writes.  This is
                normally implemented as either a _write or _release
                barrier.
_acquire_read: Ordered with respect to later reads. This is
                normally implemented as either a _read or _acquire barrier.
_dd_acquire_read: Ordered with respect to later reads that are data
               dependent on this one.  This is needed on
               a pointer read, which is later dereferenced to read a
               second value, with the expectation that the second
               read is ordered after the first one.  On most architectures,
               this is equivalent to no barrier.  (This is very
               hard to define precisely.  It should probably be avoided.
               A major problem is that optimizers tend to try to
               eliminate dependencies from the generated code, since
               dependencies force the hardware to execute the code
               serially.)

We assume that if a store is data-dependent on a previous load, then
the two are always implicitly ordered.

It is possible to test whether AO_[<size>]<op><barrier> is available on the
target platform by checking whether AO_HAVE_[<size>]<op><barrier> is defined
as a macro.

Note that we generally don't implement operations that are either
meaningless (e.g. AO_nop_acquire, AO_nop_release) or which appear to
have no clear use (e.g. AO_load_release, AO_store_acquire, AO_load_write,
AO_store_read).  On some platforms (e.g. PA-RISC) many operations
will remain undefined unless AO_REQUIRE_CAS is defined before including
the package.

When typed in the package build directory, the following command
will print operations that are unimplemented on the platform:

make test_atomic; ./test_atomic

The following command generates a file "list_atomic.i" containing the
macro expansions of all implemented operations on the platform:

make list_atomic.i

Known issues include:

We should be more precise in defining the semantics of the ordering
constraints, and if and how we can guarantee sequential consistency.

Dd_acquire_read is very hard or impossible to define in a way that cannot
be invalidated by reasonably standard compiler transformations.

Example:

If you want to initialize an object, and then "publish" a pointer to it
in a global location p, such that other threads reading the new value of
p are guaranteed to see an initialized object, it suffices to use
AO_release_write(p, ...) to write the pointer to the object, and to
retrieve it in other threads with AO_acquire_read(p).

Platform notes:

All X86: We quietly assume 486 or better.

Gcc on x86:
Define AO_USE_PENTIUM4_INSTRS to use the Pentium 4 mfence instruction.
Currently this is appears to be of marginal benefit.