summaryrefslogtreecommitdiff
path: root/storage/ndb/src/kernel/blocks/dbtup/Notes.txt
blob: 9d47c591fe8c689d81cb500be12d29d4282c79aa (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
Operations, tuples, versions
============================

Operation types.

INSERT		insert new original tuple, or insert after delete
UPDATE		update
DELETE		delete

Following need not be considered here.

READ		does not change tuples or versions
WRITE		turns into INSERT or UPDATE in LQH

We use more specific names in some cases:

first/INSERT	initial insert of new tuple
delete/INSERT	INSERT preceded by DELETE
DELETE/last	DELETE as last operation
DELETE/insert	DELETE followed by INSERT

Tuple + op 	Can be followed by
--------------	------------------
does not exist	first/INSERT
tuple exists	UPDATE DELETE
INSERT		UPDATE DELETE
UPDATE		UPDATE DELETE
DELETE		delete/INSERT

Operations on same tuple are kept in doubly linked list until
commit or abort.  The links at both ends are RNIL i.e. the list
is not circular.  The links are:

nextActiveOp	the operation BEFORE this one, in event order
prevActiveOp	the operation AFTER this one, in event order

Operations are done on the "original tuple" i.e. the tuple is
modified in place.  If an operation is about to write over data
in original tuple, it first copies the tuple to a "copy tuple".

Operation	Copy tuple
---------	----------
first/INSERT	no
delete/INSERT	yes (this is in effect an update)
UPDATE		yes
DELETE		no

The operation points to the tuples via:

realPageId	page i-value of original tuple
pageOffset	word offset of original tuple on the page
realPageIdC	page i-value of copy tuple or RNIL is no copy exists
pageOffsetC	word offset of copy tuple on the page

The original tuple and the copy tuple (if any) point back to
the operation via word 0.  In copy tuple this pointer is never
changed.  In original tuple however it always points to the LATEST
existing operation i.e. the one with prevActiveOp == RNIL.
Thus word 0 of original tuple is changed on 2 occasions:

- when a new operation is added to the list
- when commit or abort removes the latest operation

Note that commit/abort of operations occurs in random order.
The list is adjusted accordingly.

Versions
--------

Tuple version is stored in tuple word 1.  A new original tuple
gets version 0.  The version is incremented by each new operation
which makes a copy tuple.  Version number wraps around at 15 bits.

When a copy tuple is made, the version in original tuple is copied
to copy tuple as part of tuple data.  This takes place before
the version in original tuple is updated.

Each operation record contains tuple version called tupVersion.

- at insert of new original tuple, tupVersion is set to 0

- if tuple already exists, the FIRST operation (in event order)
  reads tupVersion from tuple word 1.  If the operation is
  not DELETE, the version is incremented

- subsequent operation reads tupVersion from the operation
  BEFORE it (nextActiveOp).  If this subsequent operation is
  not DELETE, the version is incremented

When the operation writes the tuple it sets word 1 to tupVersion.
In detail, per operation type, where INSERT is divided into
insert of new original tuple and insert after delete:

Operation	Copy	Increment	Set version in original
---------	----	---------	-----------------------
first/INSERT	no	no		yes, to 0
delete/INSERT	yes	yes		yes
UPDATE		yes	yes		yes
DELETE		no	no		no

Thus an existing version is incremented if and only if
a copy tuple is made.

Ordered index maintenance
-------------------------

Each index entry has logical tuple address and tuple version.
Index entries are added during prepare phase (when each operation
is executed) and removed during commit or abort phase.

Access to correct tuple version (original or copy) is required
in TUX which reads index key values 1) to check that at least one
is not null 2) to do tree search 3) to set min/max prefixes.
See "Read attributes" below.

An additional complication is that commit/abort of operations
arrives in random order.  So we cannot check for, for example,
DELETE/insert by looking at prevActiveOp.

Phase	Op		Action	Version in
-----	--		------	----------
prepare	INSERT		add	op and original
prepare	UPDATE		add	op and original
prepare	DELETE		none	-

commit	first/INSERT	none	-
commit	delete/INSERT	remove	copy tuple	1)
commit	UPDATE		remove	copy tuple	1)
commit	DELETE/last	remove	op and original
commit	DELETE/insert	none	-

abort	INSERT		remove	op
abort	UPDATE		remove	op
abort	DELETE		none	-

1) alternatively, store prevTupVersion in operation record.

Read attributes, query status
-----------------------------

TUP_READ_ATTRS signal (or equivalent direct call) reads attribute
values.  Input is logical address of original tuple and tuple
version.  The steps are:

- Translate logical address to physical address of original tuple.

- If version of original tuple in word 1 is right, stop.

- Otherwise word 0 points to LATEST not yet deleted operation.
  Walk through operation list via nextActiveOp.

- If an operation on the list has realPageIdC == RNIL, skip it.

- Otherwise find copy tuple via realPageIdC, pageOffsetC.
  If the version of the copy tuple in word 1 is right, stop.

- Call readAttributes() on the tuple found (original or copy).

In short, the version must exist in some not yet deleted tuple,
either in original or in some copy.

Note that this must work during all phases since index code
needs to read index key attributes from correct tuple version in
each add/remove operation.

TUP_QUERY_TH signal (or equivalent direct call) does same search
for tuple version.  It is called from index scan and returns info
used to decide if the scan can see the tuple.

This signal may also be called during any phase since commit/abort
of all operations is not done in one time-slice.

Commit and abort
----------------

[ hairy stuff ]

Problems
--------

Current abort code can destroy a tuple version too early.  This
happens in test case "ticuur" (insert-commit-update-update-rollback),
if abort of first update arrives before abort of second update.