src/backend/storage/buffer/README


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249

$PostgreSQL: pgsql/src/backend/storage/buffer/README,v 1.6 2003/11/29 19:51:56 pgsql Exp $

Notes about shared buffer access rules
--------------------------------------

There are two separate access control mechanisms for shared disk buffers:
reference counts (a/k/a pin counts) and buffer locks.  (Actually, there's
a third level of access control: one must hold the appropriate kind of
lock on a relation before one can legally access any page belonging to
the relation.  Relation-level locks are not discussed here.)

Pins: one must "hold a pin on" a buffer (increment its reference count)
before being allowed to do anything at all with it.  An unpinned buffer is
subject to being reclaimed and reused for a different page at any instant,
so touching it is unsafe.  Typically a pin is acquired via ReadBuffer and
released via WriteBuffer (if one modified the page) or ReleaseBuffer (if not).
It is OK and indeed common for a single backend to pin a page more than
once concurrently; the buffer manager handles this efficiently.  It is
considered OK to hold a pin for long intervals --- for example, sequential
scans hold a pin on the current page until done processing all the tuples
on the page, which could be quite a while if the scan is the outer scan of
a join.  Similarly, btree index scans hold a pin on the current index page.
This is OK because normal operations never wait for a page's pin count to
drop to zero.  (Anything that might need to do such a wait is instead
handled by waiting to obtain the relation-level lock, which is why you'd
better hold one first.)  Pins may not be held across transaction
boundaries, however.

Buffer locks: there are two kinds of buffer locks, shared and exclusive,
which act just as you'd expect: multiple backends can hold shared locks on
the same buffer, but an exclusive lock prevents anyone else from holding
either shared or exclusive lock.  (These can alternatively be called READ
and WRITE locks.)  These locks are intended to be short-term: they should not
be held for long.  Buffer locks are acquired and released by LockBuffer().
It will *not* work for a single backend to try to acquire multiple locks on
the same buffer.  One must pin a buffer before trying to lock it.

Buffer access rules:

1. To scan a page for tuples, one must hold a pin and either shared or
exclusive lock.  To examine the commit status (XIDs and status bits) of
a tuple in a shared buffer, one must likewise hold a pin and either shared
or exclusive lock.

2. Once one has determined that a tuple is interesting (visible to the
current transaction) one may drop the buffer lock, yet continue to access
the tuple's data for as long as one holds the buffer pin.  This is what is
typically done by heap scans, since the tuple returned by heap_fetch
contains a pointer to tuple data in the shared buffer.  Therefore the
tuple cannot go away while the pin is held (see rule #5).  Its state could
change, but that is assumed not to matter after the initial determination
of visibility is made.

3. To add a tuple or change the xmin/xmax fields of an existing tuple,
one must hold a pin and an exclusive lock on the containing buffer.
This ensures that no one else might see a partially-updated state of the
tuple.

4. It is considered OK to update tuple commit status bits (ie, OR the
values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
pin on a buffer.  This is OK because another backend looking at the tuple
at about the same time would OR the same bits into the field, so there
is little or no risk of conflicting update; what's more, if there did
manage to be a conflict it would merely mean that one bit-update would
be lost and need to be done again later.  These four bits are only hints
(they cache the results of transaction status lookups in pg_clog), so no
great harm is done if they get reset to zero by conflicting updates.

5. To physically remove a tuple or compact free space on a page, one
must hold a pin and an exclusive lock, *and* observe while holding the
exclusive lock that the buffer's shared reference count is one (ie,
no other backend holds a pin).  If these conditions are met then no other
backend can perform a page scan until the exclusive lock is dropped, and
no other backend can be holding a reference to an existing tuple that it
might expect to examine again.  Note that another backend might pin the
buffer (increment the refcount) while one is performing the cleanup, but
it won't be able to actually examine the page until it acquires shared
or exclusive lock.


VACUUM FULL ignores rule #5, because it instead acquires exclusive lock at
the relation level, which ensures indirectly that no one else is accessing
pages of the relation at all.

Plain (concurrent) VACUUM must respect rule #5 fully.  Obtaining the
necessary lock is done by the bufmgr routine LockBufferForCleanup().
It first gets an exclusive lock and then checks to see if the shared pin
count is currently 1.  If not, it releases the exclusive lock (but not the
caller's pin) and waits until signaled by another backend, whereupon it
tries again.  The signal will occur when UnpinBuffer decrements the shared
pin count to 1.  As indicated above, this operation might have to wait a
good while before it acquires lock, but that shouldn't matter much for
concurrent VACUUM.  The current implementation only supports a single
waiter for pin-count-1 on any particular shared buffer.  This is enough
for VACUUM's use, since we don't allow multiple VACUUMs concurrently on a
single relation anyway.


Buffer replacement strategy interface:

The two files freelist.c and buf_table.c contain the buffer cache
replacement strategy. The interface to the strategy is:

    BufferDesc *
	StrategyBufferLookup(BufferTag *tagPtr, bool recheck)

		This is allways the first call made by the buffer manager
		to check if a disk page is in memory. If so, the function
		returns the buffer descriptor and no further action is
		required.

		If the page is not in memory, StrategyBufferLookup()
		returns NULL.

		The flag recheck tells the strategy that this is a second
		lookup after flushing a dirty block. If the buffer manager
		has to evict another buffer, he will release the bufmgr lock
		while doing the write IO. During this time, another backend
		could possibly fault in the same page this backend is after,
		so we have to check again after the IO is done if the page
		is in memory now.

	BufferDesc *
	StrategyGetBuffer(void)

		The buffer manager calls this function to get an unpinned
		cache buffer who's content can be evicted. The returned
		buffer might be empty, clean or dirty.

		The returned buffer is only a cadidate for replacement.
		It is possible that while the buffer is written, another
		backend finds and modifies it, so that it is dirty again.
		The buffer manager will then call StrategyGetBuffer()
		again to ask for another candidate.

	void
	StrategyReplaceBuffer(BufferDesc *buf, Relation rnode, 
			BlockNumber blockNum)
		
		Called by the buffer manager at the time it is about to
		change the association of a buffer with a disk page.

		Before this call, StrategyBufferLookup() still has to find
		the buffer even if it was returned by StrategyGetBuffer()
		as a candidate for replacement.

		After this call, this buffer must be returned for a
		lookup of the new page identified by rnode and blockNum.

	void
	StrategyInvalidateBuffer(BufferDesc *buf)

		Called from various parts to inform that the content of
		this buffer has been thrown away. This happens for example
		in the case of dropping a relation.

		The buffer must be clean and unpinned on call.

		If the buffer associated with a disk page, StrategyBufferLookup()
		must not return it for this page after the call.

	void
	StrategyHintVacuum(bool vacuum_active)

		Because vacuum reads all relations of the entire database
		through the buffer manager, it can greatly disturb the
		buffer replacement strategy. This function is used by vacuum
		to inform that all subsequent buffer lookups are caused
		by vacuum scanning relations.

		
Buffer replacement strategy:

The buffer replacement strategy actually used in freelist.c is a
version of the Adaptive Replacement Cache (ARC) special tailored for
PostgreSQL.

The algorithm works as follows:

    C is the size of the cache in number of pages (conf: shared_buffers)
	ARC uses 2*C Cache Directory Blocks (CDB). A cache directory block
	is allwayt associated with one unique file page and "can" point to
	one shared buffer.

	All file pages known in by the directory are managed in 4 LRU lists
	named B1, T1, T2 and B2. The T1 and T2 lists are the "real" cache
	entries, linking a file page to a memory buffer where the page is
	currently cached. Consequently T1len+T2len <= C. B1 and B2 are
	ghost cache directories that extend T1 and T2 so that the strategy
	remembers pages longer. The strategy tries to keep B1len+T1len and
	B2len+T2len both at C. T1len and T2 len vary over the runtime
	depending on the lookup pattern and its resulting cache hits. The
	desired size of T1len is called T1target.

	Assuming we have a full cache, one of 5 cases happens on a lookup:

	MISS	On a cache miss, depending on T1target and the actual T1len
			the LRU buffer of T1 or T2 is evicted. Its CDB is removed
			from the T list and added as MRU of the corresponding B list.
			The now free buffer is replaced with the requested page
			and added as MRU of T1.

	T1 hit	The T1 CDB is moved to the MRU position of the T2 list.

	T2 hit	The T2 CDB is moved to the MRU position of the T2 list.

	B1 hit	This means that a buffer that was evicted from the T1
			list is now requested again, indicating that T1target is
			too small (otherwise it would still be in T1 and thus in
			memory). The strategy raises T1target, evicts a buffer
			depending on T1target and T1len and places the CDB at
			MRU of T2.

	B2 hit	This means the opposite of B1, the T2 list is probably too
			small. So the strategy lowers T1target, evicts a buffer
			and places the CDB at MRU of T2.

	Thus, every page that is found on lookup in any of the four lists
	ends up as the MRU of the T2 list. The T2 list therefore is the
	"frequency" cache, holding frequently requested pages.

	Every page that is seen for the first time ends up as the MRU of
	the T1 list. The T1 list is the "recency" cache, holding recent
	newcomers.

	The tailoring done for PostgreSQL has to do with the way, the
	query executor works. A typical UPDATE or DELETE first scans the 
	relation, searching for the tuples and then calls heap_update() or
	heap_delete(). This causes at least 2 lookups for the block in the
	same statement. In the case of multiple matches in one block even
	more often. As a result, every block touched in an UPDATE or DELETE
	would directly jump into the T2 cache, which is wrong. To prevent
	this the strategy remembers which transaction added a buffer to the
	T1 list and will not promote it from there into the T2 cache during
	the same transaction.
	
	Another specialty is the change of the strategy during VACUUM.
	Lookups during VACUUM do not represent application needs, so it
	would be wrong to change the cache balance T1target due to that
	or to cause massive cache evictions. Therefore, a page read in to
	satisfy vacuum (not those that actually cause a hit on any list)
	is placed at the LRU position of the T1 list, for immediate
	reuse. Since Vacuum usually requests many pages very fast, the
	natural side effect of this is that it will get back the very
	buffers it filled and possibly modified on the next call and will
	therefore do it's work in a few shared memory buffers, while using
	whatever it finds in the cache already.