doc/dev/osd_internals/erasure_coding/developer_notes.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552

============================
Erasure Code developer notes
============================

Introduction
------------

Each chapter of this document explains an aspect of the implementation
of the erasure code within Ceph. It is mostly based on examples being
explained to demonstrate how things work. It is written as if the
implementation is complete although it may not be the case. For
instance the plugin system and the jerasure plugin are implemented but
the erasure coded pool is not.

Reading and writing encoded chunks from and to OSDs
---------------------------------------------------

An erasure coded pool stores each object as K+M chunks. It is divided
into K data chunks and M coding chunks. The pool is configured to have
a size of K+M so that each chunk is stored in an OSD in the acting
set. The rank of the chunk is stored as `an attribute of the object
<http://tracker.ceph.com/issues/5862>`_.

For instance an erasure coded pool is created to use five OSDs ( K+M =
5 ) and sustain the loss of two of them ( M = 2 ).

When the object *NYAN* containing *ABCDEFGHI* is written to it, the
erasure encoding function splits the content in three data chunks,
simply by dividing the content in three : the first contains *ABC*,
the second *DEF* and the last *GHI*. The content will be padded if the
content length is not a multiple of K. The function also creates two
coding chunks : the fourth with *YXY* and the fifth with *GQC*. Each
chunk is stored in an OSD in the acting set. The chunks are stored in
objects that have the same name ( *NYAN* ) but reside on different
OSDs. The order in which the chunks were created must be preserved and
is stored as an attribute of the object ( shard_t ), in addition to its
name. Chunk *1* contains *ABC* and is stored on *OSD5* while chunk *4*
contains *XYY* and is stored on *OSD3*.

::
 
                             +-------------------+
                        name |        NYAN       |
                             +-------------------+
                     content |      ABCDEFGHI    |
                             +--------+----------+
                                      |
                                      |
                                      v
                               +------+------+
               +---------------+ encode(3,2) +-----------+
               |               +--+--+---+---+           |
               |                  |  |   |               |
               |          +-------+  |   +-----+         |
               |          |          |         |         |
            +--v---+   +--v---+   +--v---+  +--v---+  +--v---+
      name  | NYAN |   | NYAN |   | NYAN |  | NYAN |  | NYAN |
            +------+   +------+   +------+  +------+  +------+
     shard  |  1   |   |  2   |   |  3   |  |  4   |  |  5   |
            +------+   +------+   +------+  +------+  +------+
   content  | ABC  |   | DEF  |   | GHI  |  | YXY  |  | QGC  |
            +--+---+   +--+---+   +--+---+  +--+---+  +--+---+
               |          |          |         |         |
               |          |          |         |         |
               |          |       +--+---+     |         |
               |          |       | OSD1 |     |         |
               |          |       +------+     |         |
               |          |       +------+     |         |
               |          +------>| OSD2 |     |         |
               |                  +------+     |         |
               |                  +------+     |         |
               |                  | OSD3 |<----+         |
               |                  +------+               |
               |                  +------+               |
               |                  | OSD4 |<--------------+
               |                  +------+
               |                  +------+
               +----------------->| OSD5 |
                                  +------+


When the object *NYAN* is read from the erasure coded pool, the
decoding function reads three chunks : chunk *1* containing *ABC*,
chunk *3* containing *GHI* and chunk *4* containing *YXY* and rebuild
the original content of the object *ABCDEFGHI*. The decoding function
is informed that the chunks *2* and *5* are missing ( they are called
*erasures* ). The chunk *5* could not be read because the *OSD4* is
*out*. The decoding function can be called as soon as three chunks are
read : *OSD2* was the slowest and its chunk was not taken into
account.  

::
 
                             +-------------------+
                        name |        NYAN       |
                             +-------------------+
                     content |      ABCDEFGHI    |
                             +--------+----------+
                                      ^
                                      |
                                      |
                               +------+------+
                               | decode(3,2) |
                               | erasures 2,5|
               +-------------->|             |
               |               +-------------+
               |                     ^   ^
               |                     |   +-----+
               |                     |         |
            +--+---+   +------+   +--+---+  +--+---+
      name  | NYAN |   | NYAN |   | NYAN |  | NYAN |
            +------+   +------+   +------+  +------+
     shard  |  1   |   |  2   |   |  3   |  |  4   |
            +------+   +------+   +------+  +------+
   content  | ABC  |   | DEF  |   | GHI  |  | YXY  |
            +--+---+   +--+---+   +--+---+  +--+---+
               ^          .          ^         ^
               |    TOO   .          |         |
               |    SLOW  .       +--+---+     |
               |          ^       | OSD1 |     |
               |          |       +------+     |
               |          |       +------+     |
               |          +-------| OSD2 |     |
               |                  +------+     |
               |                  +------+     |
               |                  | OSD3 |-----+
               |                  +------+
               |                  +------+
               |                  | OSD4 | OUT
               |                  +------+
               |                  +------+
               +------------------| OSD5 |
                                  +------+

Interrupted full writes
-----------------------

In an erasure coded pool the primary OSD in the up set receives all
write operations. It is responsible for encoding the payload into K+M
chunks and sends them to the other OSDs. It is also responsible
for maintaining an authoritative version of the placement group logs.

::
 
     primary
   +---OSD 1---+
   |       log |
   |           |
   |+----+     |
   ||D1v1| 1,1 |
   |+----+     |
   +-----------+
               +---OSD 2---+
               |+----+ log |
               ||D2v1| 1,1 |
               |+----+     |
               +-----------+
               +---OSD 3---+
               |       log |
               |           |
               |+----+     |
               ||C1v1| 1,1 |
               |+----+     |
               +-----------+

An erasure coded placement group has been created with K = 2 + M = 1
and is supported by three OSDs, two for K and one for M. The acting
set of the placement group is made of *OSD 1*, *OSD 2* and *OSD 3*. An
object has been encoded and stored in the OSDs : the chunk D1v1
(i.e. Data chunk number 1 version 1) is on *OSD 1*, D2v1 on *OSD 2*
and C1v1 (i.e. Coding chunk number 1 version 1) on *OSD 3*. The
placement group logs on each OSD are identical (i.e. 1,1).  

::
 
     primary
   +---OSD 1---+
   |+----+ log |
   ||D1v2| 1,2 |<----------------- WRITE FULL
   |+----+     |
   |+----+     |
   ||D1v1| 1,1 |
   |+----+     |
   +++---------+
    ||         +---OSD 2---+
    ||  +----+ |+----+ log |
    |+-->D2v2| ||D2v1| 1,1 |
    |   +----+ |+----+     |
    |          +-----------+
    |          +---OSD 3---+
    |          |+----+ log |
    +---------->|C1v2| 1,2 |
               |+----+     |
               |+----+     |
               ||C1v1| 1,1 |
               |+----+     |
               +-----------+

*OSD 1* is the primary and receives a WRITE FULL from a client, which
means the payload is to replace the object entirely instead of
overwriting a portion of it. Version two of the object is created to
override version one. *OSD 1* encodes the payload into three chunks :
D1v2 (i.e. Data chunk number 1 version 2) will be on *OSD 1*, D2v2 on
*OSD 2* and C1v2 (i.e. Coding chunk number 1 version 2) on *OSD
3*. Each chunk is sent to the target OSD, including the primary OSD
which is responsible for storing chunks in addition to handling write
operations and maintaining an authoritative version of the placement
group logs. When an OSD receives the message instructing it to write
the chunk, it also creates a new entry in the placement group logs to
reflect the change. For instance, as soon as *OSD 3* stores *C1v2*, it
adds the entry 1,2 ( i.e. epoch 1, version 2 ) to its logs. Because
the OSDs work asynchronously, some chunks may still be in flight (
such as *D2v2* ) while others are acknowledged and on disk ( such as
*C1v1* and *D1v1* ).

::
 
     primary
   +---OSD 1---+
   |+----+ log |
   ||D1v2| 1,2 |<----------------- WRITE FULL
   |+----+     |
   |+----+     |
   ||D1v1| 1,1 |
   |+----+     |
   +++---------+
    ||         +---OSD 2---+
    ||         |+----+ log |
    |+--------->|D2v2| 1,2 |
    |          |+----+     |
    |          |+----+     |
    |          ||D2v1| 1,1 |
    |          |+----+     |
    |          +-----------+
    |          +---OSD 3---+
    |          |+----+ log |
    +---------->|C1v2| 1,2 |
               |+----+     |
               |+----+     |
               ||C1v1| 1,1 |
               |+----+     |
               +-----------+

If all goes well, the chunks are acknowledged on each OSD in the
acting set and the logs' *last_complete* pointer can move from
*1,1* to *1,2* and the files used to store the chunks of the previous
version of the object can be removed : *D1v1* on *OSD 1*, *D2v1* on
*OSD 2* and *C1v1* on *OSD 3*.

::
 
               +---OSD 1---+
               |           |
               |   DOWN    |
               |           |
               +-----------+
               +---OSD 2---+
               |+----+ log |
               ||D2v1| 1,1 |
               |+----+     |
               +-----------+
               +---OSD 3---+
               |+----+ log |
               ||C1v2| 1,2 |
               |+----+     |
               |+----+     |
               ||C1V1| 1,1 |
               |+----+     |
    primary    +-----------+
  +---OSD 4---+
  |       log |
  |       1,1 |
  |           |
  +-----------+

But accidents happen. If *OSD 1* goes down while *D2v2* is still in
flight, the object's version 2 is partially written : *OSD 3* has
one chunk but that is no not enough to recover. It lost two chunks :
*D1v2* and *D2v2* and the erasure coding parameters K = 2 + M = 1
require that at least two chunks are available to rebuild the
third. *OSD 4* becomes the new primary and finds that the
*last_complete* log entry ( i.e. all objects before this entry were
known to be available on all OSDs in the previous acting set ) is
*1,1* and that will be the head of the new authoritative log.

::
 
               +---OSD 2---+
               |+----+ log |
               ||D2v1| 1,1 |
               |+----+     |
               +-----------+
               +---OSD 3---+
               |+----+ log |
               ||C1V1| 1,1 |
               |+----+     |
    primary    +-----------+
  +---OSD 4---+
  |       log |
  |       1,1 |
  |           |
  +-----------+

The log entry *1,2* found on *OSD 3* is divergent from the new
authoritative log provided by *OSD 4* : it is discarded and the file
containing the *C1v2* chunk is removed.

::
 
               +---OSD 2---+
               |+----+ log |
               ||D2v1| 1,1 |
               |+----+     |
               +-----------+
               +---OSD 3---+
               |+----+ log |
               ||C1V1| 1,1 |
               |+----+     |
    primary    +-----------+
  +---OSD 4---+
  |+----+ log |
  ||D1v1| 1,1 |
  |+----+     |
  +-----------+

The *D1v1* chunk is rebuilt with the *decode* function of the erasure
coding library during scrubbing and stored on the new primary *OSD 4*.

Interrupted append
------------------

An object is coded in stripes, either because it is too big or because
it is created with multiple write operations instead of a single full
write. When appending to an existing object, the stripe size is
retrieved from the attributes of the object. It applies, for instance,
when *rgw* writes an object with a sequence of appends instead of a
single full write.

::
 
     primary
   +---OSD 1---+
   |+-s1-+ log |
   ||S1D1| 1,2 |<----------------- APPEND
   ||----|     |
   ||S2D1| 1,1 |
   |+----+     |
   +++---------+
    ||         +---OSD 2---+
    ||  +-s2-+ |+-s2-+ log |
    |+-->S2D2| ||S1D2| 1,1 |
    |   +----+ |+----+     |
    |          +-----------+
    |          +---OSD 3---+
    |          |+-s3-+ log |
    +---------->|S1C1| 1,2 |
               ||----|     |
               ||S2C1| 1,1 |
               |+----+     |
               +-----------+

*OSD 1* is the primary and receives an APPEND from a client, meaning
the payload is to be appended to the end of the object. *OSD 1*
encodes the payload into three chunks : S2D1 (i.e. Stripe two data
chunk number 1 ) will be in s1 ( shard 1 ) on *OSD 1*, S2D2 in s2 on
*OSD 2* and S2C1 (i.e. Stripe two coding chunk number 1 ) in s3 on
*OSD 3*. Each chunk is sent to the target OSD, including the primary
OSD which is responsible for storing chunks in addition to handling
write operations and maintaining an authoritative version of the
placement group logs. When an OSD receives the message instructing it
to write the chunk, it also creates a new entry in the placement group
logs to reflect the change. For instance, as soon as *OSD 3* stores
*S2C1*, it adds the entry 1,2 ( i.e. epoch 1, version 2 ) to its
logs. The log entry also carries the nature of the operation: in this
case 1,2 is an APPEND where 1,1 was a CREATE. Because the OSDs work
asynchronously, some chunks may still be in flight ( such as *S2D2* )
while others are acknowledged and on disk (such as *S2D1* and *S2C1*).

::
 
               +---OSD 1---+
               |           |
               |   DOWN    |
               |           |
               +-----------+
               +---OSD 2---+
               |+-s2-+ log |
               ||S1D2| 1,1 |
               |+----+     |
               +-----------+
               +---OSD 3---+
               |+-s3-+ log |
               ||S1C1| 1,2 |
               ||----|     |
               ||S2C1| 1,1 |
               |+----+     |
    primary    +-----------+
  +---OSD 4---+
  |       log |
  |       1,1 |
  |           |
  +-----------+

If *OSD 1* goes down while *S2D2* is still in flight, the payload is
partially appended : s3 (shard 3) in *OSD 3* has one chunk but does
not have enough to recover. Two chunks were lost (*S2D1* and S2D2) but
the erasure coding parameters K = 2 + M = 1 requires that at least two
chunks are available to rebuild the third. *OSD 4* becomes the new
primary and finds that the *last_complete* log entry ( i.e. all
objects before this entry were known to be available on all OSDs in
the previous acting set ) is *1,1* and will be the head of the new
authoritative log.

::
 
               +---OSD 2---+
               |+-s2-+ log |
               ||S1D2| 1,1 |
               |+----+     |
               +-----------+
               +---OSD 3---+
               |+-s3-+ log |
               ||S1C1| 1,1 |
               |+----+     |
    primary    +-----------+
  +---OSD 4---+
  |       log |
  |       1,1 |
  |           |
  +-----------+

The log entry *1,2* found on *OSD 3* is divergent from the new
authoritative log provided by *OSD 4* : it is discarded and the file
containing the *S2C1* chunk is truncated to the nearest multiple of
the stripe size.

Erasure code library
--------------------

Using `Reed-Solomon <https://en.wikipedia.org/wiki/Reed_Solomon>`_,
with parameters K+M, object O is encoded by dividing it into chunks O1,
O2, ...  OM and computing coding chunks P1, P2, ... PK. Any K chunks
out of the available K+M chunks can be used to obtain the original
object.  If data chunk O2 or coding chunk P2 are lost, they can be
repaired using any K chunks out of the K+M chunks. If more than M
chunks are lost, it is not possible to recover the object.

Reading the original content of object O could be a simple
concatenation of O1, O2, ... OM, because the plugins are using
`systematic codes
<http://en.wikipedia.org/wiki/Systematic_code>`_. Otherwise the chunks
must be given to the erasure code library *decode* method to retrieve
the content of the object.

Reed-Solomon is significantly more expensive to encode than fountain
codes with the current `jerasure implementation
<http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627.html>`_. However
`gf-complete
<http://web.eecs.utk.edu/~plank/plank/papers/CS-13-703.html>`_ that
will be used in the upcoming version of jerasure is twice faster and
the difference becomes negligible. The difference is even more
important when an object is divided in hundreds or more chunks, but
Ceph will typically be used with less than 32 chunks.

Performance depend on the parameters to the encoding functions and
is also influenced by the packet sizes used when calling the encoding
functions ( for Cauchy or Liberation for instance ): smaller packets
means more calls and more overhead.

Although Reed-Solomon is provided as a default, Ceph uses it via an
`abstract API <https://github.com/ceph/ceph/blob/08a97ae45f4df58a6a8ea8a6400934d860cf5eb4/src/osd/ErasureCodeInterface.h>`_ designed to
allow each pool to choose the plugin that implements it using
`key=value pairs when creating the pool
<https://github.com/ceph/ceph/blob/08a97ae45f4df58a6a8ea8a6400934d860cf5eb4/src/mon/MonCommands.h#L483>`_.

::
 
  ceph osd pool create <pool> \
     erasure-code-directory=<dir> \
     erasure-code-plugin=<plugin>

The *<plugin>* is dynamically loaded from *<dir>* (defaults to
*/usr/lib/ceph/erasure-code* ) and expected to implement the *int
__erasure_code_init(char *plugin_name)* function which is responsible
for registering an object derived from *ErasureCodePlugin* in the
registry. The `ErasureCodePluginExample <https://github.com/ceph/ceph/blob/08a97ae45f4df58a6a8ea8a6400934d860cf5eb4/src/test/osd/ErasureCodePluginExample.cc#L32>`_ plugin reads:

::
 
  ErasureCodePluginRegistry &instance = 
                             ErasureCodePluginRegistry::instance();
  instance.add(plugin_name, new ErasureCodePluginExample());

The *ErasureCodePlugin* derived object must provide a factory method
from which the concrete implementation of the *ErasureCodeInterface*
object can be generated. The `ErasureCodePluginExample plugin <https://github.com/ceph/ceph/blob/08a97ae45f4df58a6a8ea8a6400934d860cf5eb4/src/test/osd/ErasureCodePluginExample.cc#L22>`_ reads:

::
 
  virtual int factory(const map<std::string,std::string> &parameters,
                      ErasureCodeInterfaceRef *erasure_code) {
    *erasure_code = ErasureCodeInterfaceRef(new ErasureCodeExample(parameters));
    return 0;
  } 

The *parameters* argument is the list of *key=value* pairs that were
set when the pool was created. Each *key* must be prefixed with
*erasure-code* to avoid name collisions:

::
 
  ceph osd pool create poolname 123 \
     erasure-code-directory=<dir>         \ # mandatory
     erasure-code-plugin=jerasure         \ # mandatory
     erasure-code-m=10                    \ # optional and plugin dependant
     erasure-code-k=3                     \ # optional and plugin dependant
     erasure-code-technique=reed_sol_van  \ # optional and plugin dependant

Scrubbing
---------

See also `Refactor scrub to use PGBackend methods <http://tracker.ceph.com/issues/5861>`_
The simplest form of scrubbing is to check with each OSDs holding a
chunk if it exists locally. If more thank M chunks are missing the
object is marked as lost. If up to M chunks are missing they are
repaired and written to the relevant OSDs.

From time to time it may make sense to attempt to read an object,
using all of its chunks. If the decode function fails, the object is
lost.

Bit flips happen. Not often, but it is possible. Here is `an article
from 2011 <http://www.linux-mag.com/id/8794/>`_ also search for "bit
rot" and "bit error rate". To detect corrupted chunks, a checksum
(CRC23C for instance) must be added as an attribute of the file
containing the chunk ( or shard ) so that deep scrubbing can check
that the chunk is valid by recomputing the content of the chunk and
compare it with the signature. BTRFS and ZFS have a CRC32C check
built-in on a per block basis.

Notes
-----

If the objects are large, it may be impractical to encode and decode
them in memory. However, when using *RBD* a 1TB device is divided in
many individual 4MB objects and *RGW* does the same.

Encoding and decoding is implemented in the OSD. Although it could be
implemented client side for read write, the OSD must be able to encode
and decode on its own when scrubbing.