summaryrefslogtreecommitdiff
path: root/docs/misc/livepatch.pandoc
blob: a94fb57eb568e85a25c93bf6a988f123d4e48443 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
# Xen Live Patching Design v2

## Rationale

A mechanism is required to binarily patch the running hypervisor with new
opcodes that have come about due to primarily security updates.

This document describes the design of the API that would allow us to
upload to the hypervisor binary patches.

The document is split in four sections:

 * Detailed descriptions of the problem statement.
 * Design of the data structures.
 * Design of the hypercalls.
 * Implementation notes that should be taken into consideration.


## Glossary

 * splice - patch in the binary code with new opcodes
 * trampoline - a jump to a new instruction.
 * payload - telemetries of the old code along with binary blob of the new
   function (if needed).
 * reloc - telemetries contained in the payload to construct proper trampoline.
 * hook - an auxiliary function being called before, during or after payload
          application or revert.
 * quiescing zone - period when all CPUs are lock-step with each other.

## History

The document has gone under various reviews and only covers v1 design.

The end of the document has a section titled `Not Yet Done` which
outlines ideas and design for the future version of this work.

## Multiple ways to patch

The mechanism needs to be flexible to patch the hypervisor in multiple ways
and be as simple as possible. The compiled code is contiguous in memory with
no gaps - so we have no luxury of 'moving' existing code and must either
insert a trampoline to the new code to be executed - or only modify in-place
the code if there is sufficient space. The placement of new code has to be done
by hypervisor and the virtual address for the new code is allocated dynamically.

This implies that the hypervisor must compute the new offsets when splicing
in the new trampoline code. Where the trampoline is added (inside
the function we are patching or just the callers?) is also important.

To lessen the amount of code in hypervisor, the consumer of the API
is responsible for identifying which mechanism to employ and how many locations
to patch. Combinations of modifying in-place code, adding trampoline, etc
has to be supported. The API should allow read/write any memory within
the hypervisor virtual address space.

We must also have a mechanism to query what has been applied and a mechanism
to revert it if needed.

## Workflow

The expected workflows of higher-level tools that manage multiple patches
on production machines would be:

 * The first obvious task is loading all available / suggested
   hotpatches when they are available.
 * Whenever new hotpatches are installed, they should be loaded too.
 * One wants to query which modules have been loaded at runtime.
 * If unloading is deemed safe (see unloading below), one may want to
   support a workflow where a specific hotpatch is marked as bad and
   unloaded.

## Patching code

The first mechanism to patch that comes in mind is in-place replacement.
That is replace the affected code with new code. Unfortunately the x86
ISA is variable size which places limits on how much space we have available
to replace the instructions. That is not a problem if the change is smaller
than the original opcode and we can fill it with nops. Problems will
appear if the replacement code is longer.

The second mechanism is by ti replace the call or jump to the
old function with the address of the new function.

A third mechanism is to add a jump to the new function at the
start of the old function. N.B. The Xen hypervisor implements the third
mechanism. See `Trampoline (e9 opcode)` section for more details.

### Example of trampoline and in-place splicing

As example we will assume the hypervisor does not have XSA-132 (see
[domctl/sysctl: don't leak hypervisor stack to toolstacks](https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=4ff3449f0e9d175ceb9551d3f2aecb59273f639d))
and we would like to binary patch the hypervisor with it. The original code
looks as so:

    48 89 e0                  mov    %rsp,%rax
    48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax

while the new patched hypervisor would be:

    48 c7 45 b8 00 00 00 00   movq   $0x0,-0x48(%rbp)
    48 c7 45 c0 00 00 00 00   movq   $0x0,-0x40(%rbp)
    48 c7 45 c8 00 00 00 00   movq   $0x0,-0x38(%rbp)
    48 89 e0                  mov    %rsp,%rax
    48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax

This is inside the `arch_do_domctl`. This new change adds 21 extra
bytes of code which alters all the offsets inside the function. To alter
these offsets and add the extra 21 bytes of code we might not have enough
space in .text to squeeze this in.

As such we could simplify this problem by only patching the site
which calls `arch_do_domctl`:

    do_domctl:
    e8 4b b1 05 00          callq  ffff82d08015fbb9 <arch_do_domctl>

with a new address for where the new `arch_do_domctl` would be (this
area would be allocated dynamically).

Astute readers will wonder what we need to do if we were to patch `do_domctl`
- which is not called directly by hypervisor but on behalf of the guests via
the `compat_hypercall_table` and `hypercall_table`.  Patching the offset in
`hypercall_table` for `do_domctl`:

    ffff82d08024d490:   79 30
    ffff82d08024d492:   10 80 d0 82 ff ff

with the new address where the new `do_domctl` is possible. The other
place where it is used is in `hvm_hypercall64_table` which would need
to be patched in a similar way. This would require an in-place splicing
of the new virtual address of `arch_do_domctl`.

In summary this example patched the callee of the affected function by

 * Allocating memory for the new code to live in,
 * Changing the virtual address in all the functions which called the old
   code (computing the new offset, patching the callq with a new callq).
 * Changing the function pointer tables with the new virtual address of
   the function (splicing in the new virtual address). Since this table
   resides in the .rodata section we would need to temporarily change the
   page table permissions during this part.

However it has drawbacks - the safety checks which have to make sure
the function is not on the stack - must also check every caller. For some
patches this could mean - if there were an sufficient large amount of
callers - that we would never be able to apply the update.

Having the patching done at predetermined instances where the stacks
are not deep mostly solves this problem.

### Example of different trampoline patching.

An alternative mechanism exists where we can insert a trampoline in the
existing function to be patched to jump directly to the new code. This
lessens the locations to be patched to one but it puts pressure on the
CPU branching logic (I-cache, but it is just one unconditional jump).

For this example we will assume that the hypervisor has not been compiled with
XSA-125 (see
[pre-fill structures for certain HYPERVISOR_xen_version sub-ops](https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=fe2e079f642effb3d24a6e1a7096ef26e691d93e))
which mem-sets an structure in `xen_version` hypercall. This function is not
called **anywhere** in the hypervisor (it is called by the guest) but
referenced in the `compat_hypercall_table` and `hypercall_table` (and
indirectly called from that). Patching the offset in `hypercall_table` for the
old `do_xen_version`:

    ffff82d08024b270 <hypercall_table>:
    ...
    ffff82d08024b2f8:   9e 2f 11 80 d0 82 ff ff

with the new address where the new `do_xen_version` is possible. The other
place where it is used is in `hvm_hypercall64_table` which would need
to be patched in a similar way. This would require an in-place splicing
of the new virtual address of `do_xen_version`.

An alternative solution would be to patch insert a trampoline in the
old `do_xen_version` function to directly jump to the new `do_xen_version`:

    ffff82d080112f9e do_xen_version:
    ffff82d080112f9e:       48 c7 c0 da ff ff ff    mov    $0xffffffffffffffda,%rax
    ffff82d080112fa5:       83 ff 09                cmp    $0x9,%edi
    ffff82d080112fa8:       0f 87 24 05 00 00       ja     ffff82d0801134d2 ; do_xen_version+0x534

with:

    ffff82d080112f9e do_xen_version:
    ffff82d080112f9e:       e9 XX YY ZZ QQ          jmpq   [new do_xen_version]

which would lessen the amount of patching to just one location.

In summary this example patched the affected function to jump to the
new replacement function which required:

 * Allocating memory for the new code to live in,
 * Inserting trampoline with new offset in the old function to point to the
   new function.
 * Optionally we can insert in the old function a trampoline jump to an function
   providing an BUG_ON to catch errant code.

The disadvantage of this are that the unconditional jump will consume a small
I-cache penalty. However the simplicity of the patching and higher chance
of passing safety checks make this a worthwhile option.

This patching has a similar drawback as inline patching - the safety
checks have to make sure the function is not on the stack. However
since we are replacing at a higher level (a full function as opposed
to various offsets within functions) the checks are simpler.

Having the patching done at predetermined instances where the stacks
are not deep mostly solves this problem as well.

### Security

With this method we can re-write the hypervisor - and as such we **MUST** be
diligent in only allowing certain guests to perform this operation.

Furthermore with SecureBoot or tboot, we **MUST** also verify the signature
of the payload to be certain it came from a trusted source and integrity
was intact.

As such the hypercall **MUST** support an XSM policy to limit what the guest
is allowed to invoke. If the system is booted with signature checking the
signature checking will be enforced.

## Design of payload format

The payload **MUST** contain enough data to allow us to apply the update
and also safely reverse it. As such we **MUST** know:

 * The locations in memory to be patched. This can be determined dynamically
   via symbols or via virtual addresses.
 * The new code that will be patched in.

This binary format can be constructed using an custom binary format but
there are severe disadvantages of it:

 * The format might need to be changed and we need an mechanism to accommodate
   that.
 * It has to be platform agnostic.
 * Easily constructed using existing tools.

As such having the payload in an ELF file is the sensible way. We would be
carrying the various sets of structures (and data) in the ELF sections under
different names and with definitions.

Note that every structure has padding. This is added so that the hypervisor
can re-use those fields as it sees fit.

Earlier design attempted to ineptly explain the relations of the ELF sections
to each other without using proper ELF mechanism (sh_info, sh_link, data
structures using Elf types, etc). This design will explain the structures
and how they are used together and not dig in the ELF format - except mention
that the section names should match the structure names.

The Xen Live Patch payload is a relocatable ELF binary. A typical binary would have:

 * One or more .text sections.
 * Zero or more read-only data sections.
 * Zero or more data sections.
 * Relocations for each of these sections.

It may also have some architecture-specific sections. For example:

 * Alternatives instructions.
 * Bug frames.
 * Exception tables.
 * Relocations for each of these sections.

The Xen Live Patch core code loads the payload as a standard ELF binary, relocates it
and handles the architecture-specifc sections as needed. This process is much
like what the Linux kernel module loader does.

The payload contains at least three sections:

 * `.livepatch.funcs` - which is an array of livepatch_func structures.
   and/or any of:
 * `.livepatch.hooks.{preapply,postapply,prerevert,postrevert}'
 * `.livepatch.hooks.{apply,revert}`
   - which are a pointer to a hook function pointer.

 * `.livepatch.xen_depends` - which is an ELF Note that describes what Xen
    build-id the payload depends on. **MUST** have one.
 * `.livepatch.depends` - which is an ELF Note that describes what the payload
    depends on. **MUST** have one.
 *  `.note.gnu.build-id` - the build-id of this payload. **MUST** have one.

### .livepatch.funcs

The `.livepatch.funcs` contains an array of livepatch_func structures
which describe the functions to be patched:

    struct livepatch_func {
        const char *name;
        void *new_addr;
        void *old_addr;
        uint32_t new_size;
        uint32_t old_size;
        uint8_t version;
        uint8_t opaque[31];
        /* Added to livepatch payload version 2: */
        uint8_t applied;
        uint8_t _pad[7];
        livepatch_expectation_t expect;
    };

The size of the structure is 104 bytes on 64-bit hypervisors. It will be
92 on 32-bit hypervisors.
The version 2 of the payload adds additional 8 bytes to the structure size.

 * `name` is the symbol name of the old function. Only used if `old_addr` is
   zero, otherwise will be used during dynamic linking (when hypervisor loads
   the payload).
 * `old_addr` is the address of the function to be patched and is filled in at
   payload generation time if hypervisor function address is known. If unknown,
   the value *MUST* be zero and the hypervisor will attempt to resolve the
   address.
 * `new_addr` can either have a non-zero value or be zero.
   * If there is a non-zero value, then it is the address of the function that
    is replacing the old function and the address is recomputed during
    relocation.  The value **MUST** be the address of the new function in the
    payload file.
   * If the value is zero, then we NOPing out at the `old_addr` location
    `new_size` bytes.
 * `old_size` contains the sizes of the respective `old_addr` function in
    bytes.  The value of `old_size` **MUST** not be zero.
 * `new_size` depends on what `new_addr` contains:
   * If `new_addr` contains an non-zero value, then `new_size` has the size of
    the new function (which will replace the one at `old_addr`) in bytes.
   * If the value of `new_addr` is zero then `new_size` determines how many
    instruction bytes to NOP (up to opaque size modulo smallest platform
    instruction - 1 byte x86 and 4 bytes on ARM).
 * `version` indicates version of the generated payload.
 * `opaque` **MUST** be zero.

The version 2 of the payload adds the following fields to the structure:

  * `applied` tracks function's applied/reverted state. It has a boolean type
    either LIVEPATCH_FUNC_NOT_APPLIED or LIVEPATCH_FUNC_APPLIED.
  * `_pad[7]` adds padding to align to 8 bytes.
  * `expect` is an optional structure containing expected to-be-replaced data
    (mostly for inline asm patching). The `expect` structure format is:

    struct livepatch_expectation {
        uint8_t enabled : 1;
        uint8_t len : 5;
        uint8_t rsv: 2;
        uint8_t data[LIVEPATCH_OPAQUE_SIZE]; /* Same size as opaque[] buffer of
                                            struct livepatch_func. This is the
                                            max number of bytes to be patched */
    };
    typedef struct livepatch_expectation livepatch_expectation_t;

    * `enabled` allows to enable the expectation check for given function.
      Default state is disabled.
    * `len` specifies the number of valid bytes in `data` array. 5 bits is
      enough to specify values up to 32 (of bytes), which is above the array
      size.
    * `rsv` reserved bitfields. **MUST** be zero.
    * `data` contains expected bytes of content to be replaced. Same size as
      `opaque` buffer of `struct livepatch_func` (max number of bytes to be
      patched).

The size of the `livepatch_func` array is determined from the ELF section
size.

When applying the patch the hypervisor iterates over each `livepatch_func`
structure and the core code inserts a trampoline at `old_addr` to `new_addr`.
The `new_addr` is altered when the ELF payload is loaded.

When reverting a patch, the hypervisor iterates over each `livepatch_func`
and the core code copies the data from the undo buffer (private internal copy)
to `old_addr`.

It optionally may contain the address of hooks to be called right before
being applied and after being reverted (while all CPUs are still in quiescing
zone). These hooks do not have access to payload structure.

 * `.livepatch.hooks.load` - an array of function pointers.
 * `.livepatch.hooks.unload` - an array of function pointers.

It optionally may also contain the address of pre- and post- vetoing hooks to
be called before (pre) or after (post) apply and revert payload actions (while
all CPUs are already released from quiescing zone). These hooks do have
access to payload structure. The pre-apply hook can prevent from loading the
payload if encoded in it condition is not met. Accordingly, the pre-revert
hook can prevent from unloading the livepatch if encoded in it condition is not
met.

 * `.livepatch.hooks.{preapply,postapply}`
 * `.livepatch.hooks.{prerevert,postrevert}`
   - which are a pointer to a single hook function pointer.

Finally, it optionally may also contain the address of apply or revert action
hooks to be called instead of the default apply and revert payload actions
(while all CPUs are kept in quiescing zone). These hooks do have access to
payload structure.

 * `.livepatch.hooks.{apply,revert}`
   - which are a pointer to a single hook function pointer.

### Example of .livepatch.funcs

A simple example of what a payload file can be:

    /* MUST be in sync with hypervisor. */
    struct livepatch_func {
        const char *name;
        void *new_addr;
        void *old_addr;
        uint32_t new_size;
        uint32_t old_size;
        uint8_t version;
        uint8_t pad[31];
        /* Added to livepatch payload version 2: */
        uint8_t applied;
        uint8_t _pad[7];
        livepatch_expectation_t expect;
    };

    /* Our replacement function for xen_extra_version. */
    const char *xen_hello_world(void)
    {
        return "Hello World";
    }

    static unsigned char patch_this_fnc[] = "xen_extra_version";

    struct livepatch_func livepatch_hello_world = {
        .version = LIVEPATCH_PAYLOAD_VERSION,
        .name = patch_this_fnc,
        .new_addr = xen_hello_world,
        .old_addr = (void *)0xffff82d08013963c, /* Extracted from xen-syms. */
        .new_size = 13, /* To be be computed by scripts. */
        .old_size = 13, /* -----------""---------------  */
        /* Added to livepatch payload version 2: */
        .expect = { /* All fields to be filled manually */
            .enabled = 1,
            .len = 5,
            .rsv = 0,
            .data = { 0x48, 0x8d, 0x05, 0x33, 0x1C }
        },
    } __attribute__((__section__(".livepatch.funcs")));

Code must be compiled with `-fPIC`.

### Hooks

#### .livepatch.hooks.load and .livepatch.hooks.unload

This section contains an array of function pointers to be executed
before payload is being applied (.livepatch.funcs) or after reverting
the payload. This is useful to prepare data structures that need to
be modified patching.

Each entry in this array is eight bytes.

The type definition of the function are as follow:

    typedef void (*livepatch_loadcall_t)(void);
    typedef void (*livepatch_unloadcall_t)(void);

#### .livepatch.hooks.preapply

This section contains a pointer to a single function pointer to be executed
before apply action is scheduled (and thereby before CPUs are put into
quiescing zone). This is useful to prevent from applying a payload when
certain expected conditions aren't met or when mutating actions implemented
in the hook fail or cannot be executed.
This type of hooks do have access to payload structure.

Each entry in this array is eight bytes.

The type definition of the function are as follow:

    typedef int livepatch_precall_t(livepatch_payload_t *arg);

#### .livepatch.hooks.postapply

This section contains a pointer to a single function pointer to be executed
after apply action has finished and after all CPUs left the quiescing zone.
This is useful to provide an ability to follow up on actions performed by
the preapply hook. Especially, when module application was successful or to
be able to undo certain preparation steps of the preapply hook in case of a
failure. The success/failure error code is provided to the postapply hooks
via the `rc` field of the payload structure.
This type of hooks do have access to payload structure.

Each entry in this array is eight bytes.

The type definition of the function are as follow:

    typedef void livepatch_postcall_t(livepatch_payload_t *arg);

#### .livepatch.hooks.prerevert

This section contains a pointer to a single function pointer to be executed
before revert action is scheduled (and thereby before CPUs are put into
quiescing zone). This is useful to prevent from reverting a payload when
certain expected conditions aren't met or when mutating actions implemented
in the hook fail or cannot be executed.
This type of hooks do have access to payload structure.

Each entry in this array is eight bytes.

The type definition of the function are as follow:

    typedef int livepatch_precall_t(livepatch_payload_t *arg);

#### .livepatch.hooks.postrevert

This section contains a pointer to a single function pointer to be executed
after revert action has finished and after all CPUs left the quiescing zone.
This is useful to provide an ability to perform cleanup of all previously
executed mutating actions in order to restore the original system state from
before the current payload application. The success/failure error code is
provided to the postrevert hook via the `rc` field of the payload structure.
This type of hooks do have access to payload structure.

Each entry in this array is eight bytes.

The type definition of the function are as follow:

    typedef void livepatch_postcall_t(livepatch_payload_t *arg);

#### .livepatch.hooks.apply and .livepatch.hooks.revert

This section contains a pointer to a single function pointer to be executed
instead of a default apply (or revert) action function. This is useful to
replace or augment default behavior of the apply (or revert) action that
requires all CPUs to be in the quiescing zone.
This type of hooks do have access to payload structure.

Each entry in this array is eight bytes.

The type definition of the function are as follow:

    typedef int livepatch_actioncall_t(livepatch_payload_t *arg);

### .livepatch.xen_depends, .livepatch.depends and .note.gnu.build-id

To support dependencies checking and safe loading (to load the
appropiate payload against the right hypervisor) there is a need
to embbed an build-id dependency.

This is done by the payload containing sections `.livepatch.xen_depends`
and `.livepatch.depends` which follow the format of an ELF Note.
The contents of these (name, and description) are specific to the linker
utilized to build the hypevisor and payload.

If GNU linker is used then the name is `GNU` and the description
is a NT_GNU_BUILD_ID type ID. The description can be an SHA1
checksum, MD5 checksum or any unique value.

The size of these structures varies with the `--build-id` linker option.

There are two kinds of build-id dependencies:

 * Xen build-id dependency (.livepatch.xen_depends section)
 * previous payload build-id dependency (.livepatch.depends section)

See "Live patch interdependencies" for more information.

## Hypercalls

We will employ the sub operations of the system management hypercall (sysctl).
There are to be four sub-operations:

 * upload the payloads.
 * listing of payloads summary uploaded and their state.
 * getting an particular payload summary and its state.
 * command to apply, delete, or revert the payload.

Most of the actions are asynchronous therefore the caller is responsible
to verify that it has been applied properly by retrieving the summary of it
and verifying that there are no error codes associated with the payload.

We **MUST** make some of them asynchronous due to the nature of patching
it requires every physical CPU to be lock-step with each other.
The patching mechanism while an implementation detail, is not an short
operation and as such the design **MUST** assume it will be an long-running
operation.

The sub-operations will spell out how preemption is to be handled (if at all).

Furthermore it is possible to have multiple different payloads for the same
function. As such an unique name per payload has to be visible to allow proper manipulation.

The hypercall is part of the `xen_sysctl`. The top level structure contains
one uint32_t to determine the sub-operations and one padding field which
*MUST* always be zero.

    struct xen_sysctl_livepatch_op {
        uint32_t cmd;                   /* IN: XEN_SYSCTL_LIVEPATCH_*. */
        uint32_t pad;                   /* IN: Always zero. */
	    union {
              ... see below ...
            } u;
    };

while the rest of hypercall specific structures are part of the this structure.

### Basic type: struct xen_livepatch_name

Most of the hypercalls employ an shared structure called `struct xen_livepatch_name`
which contains:

 * `name` - pointer where the string for the name is located.
 * `size` - the size of the string
 * `pad` - padding - to be zero.

The structure is as follow:

    /*
     *  Uniquely identifies the payload.  Should be human readable.
     * Includes the NUL terminator
     */
    #define XEN_LIVEPATCH_NAME_SIZE 128
    struct xen_livepatch_name {
        XEN_GUEST_HANDLE_64(char) name;         /* IN, pointer to name. */
        uint16_t size;                          /* IN, size of name. May be upto
                                                   XEN_LIVEPATCH_NAME_SIZE. */
        uint16_t pad[3];                        /* IN: MUST be zero. */
    };

### XEN_SYSCTL_LIVEPATCH_UPLOAD (0)

Upload a payload to the hypervisor. The payload is verified
against basic checks and if there are any issues the proper return code
will be returned. The payload is not applied at this time - that is
controlled by *XEN_SYSCTL_LIVEPATCH_ACTION*.

The caller provides:

 * A `struct xen_livepatch_name` called `name` which has the unique name.
 * `size` the size of the ELF payload (in bytes).
 * `payload` the virtual address of where the ELF payload is.

The `name` could be an UUID that stays fixed forever for a given
payload. It can be embedded into the ELF payload at creation time
and extracted by tools.

The return value is zero if the payload was succesfully uploaded.
Otherwise an -XEN_EXX return value is provided. Duplicate `name` are not supported.

The `payload` is the ELF payload as mentioned in the `Payload format` section.

The structure is as follow:

    struct xen_sysctl_livepatch_upload {
        xen_livepatch_name_t name;          /* IN, name of the patch. */
        uint64_t size;                      /* IN, size of the ELF file. */
        XEN_GUEST_HANDLE_64(uint8) payload; /* IN: ELF file. */
    };

### XEN_SYSCTL_LIVEPATCH_GET (1)

Retrieve an status of an specific payload. This caller provides:

 * A `struct xen_livepatch_name` called `name` which has the unique name.
 * A `struct xen_livepatch_status` structure. The member values will
   be over-written upon completion.

Upon completion the `struct xen_livepatch_status` is updated.

 * `status` - indicates the current status of the payload:
   * *LIVEPATCH_STATUS_CHECKED* (1) loaded and the ELF payload safety checks passed.
   * *LIVEPATCH_STATUS_APPLIED* (2) loaded, checked, and applied.
   *  No other value is possible.
 * `rc` - -XEN_EXX type errors encountered while performing the last
   LIVEPATCH_ACTION_* operation. The normal values can be zero or -XEN_EAGAIN which
   respectively mean: success or operation in progress. Other values
   imply an error occurred. If there is an error in `rc`, `status` will **NOT**
   have changed.

The return value of the hypercall is zero on success and -XEN_EXX on failure.
(Note that the `rc` value can be different from the return value, as in
rc = -XEN_EAGAIN and return value can be 0).

For example, supposing there is an payload:

    status: LIVEPATCH_STATUS_CHECKED
    rc: 0

We apply an action - LIVEPATCH_ACTION_REVERT - to revert it (which won't work
as we have not even applied it. Afterwards we will have:

    status: LIVEPATCH_STATUS_CHECKED
    rc: -XEN_EINVAL

It has failed but it remains loaded.

This operation is synchronous and does not require preemption.

The structure is as follow:

    struct xen_livepatch_status {
    #define LIVEPATCH_STATUS_CHECKED      1
    #define LIVEPATCH_STATUS_APPLIED      2
        uint32_t state;                 /* OUT: LIVEPATCH_STATE_*. */
        int32_t rc;                     /* OUT: 0 if no error, otherwise -XEN_EXX. */
    };

    struct xen_sysctl_livepatch_get {
        xen_livepatch_name_t name;      /* IN, the name of the payload. */
        xen_livepatch_status_t status;  /* IN/OUT: status of the payload. */
    };

### XEN_SYSCTL_LIVEPATCH_LIST (2)

Retrieve an array of abbreviated status, names and metadata of payloads that are
loaded in the hypervisor.

The caller provides:

 * `version`. Version of the payload. Caller should re-use the field provided by
    the hypervisor. If the value differs the data is stale.
 * `idx` Index iterator. The index into the hypervisor's payload count. It is
    recommended that on first invocation zero be used so that `nr` (which the
    hypervisor will update with the remaining payload count) be provided.
    Also the hypervisor will provide `version` with the most current value,
    calculated total size of all payloads' names and calculated total size of
    all payload's metadata.
 * `nr` The max number of entries to populate. Can be zero which will result
    in the hypercall being a probing one and return the number of payloads
    (and update the `version`).
 * `pad` - *MUST* be zero.
 * `status` Virtual address of where to write `struct xen_livepatch_status`
   structures. Caller *MUST* allocate up to `nr` of them.
 * `name` - Virtual address of where to write the unique name of the payloads.
   Caller *MUST* allocate enough space to be able to store all received data
   (i.e. total allocated space *MUST* match the `name_total_size` value
   provided by the hypervisor). Individual payload name cannot be longer than
   **XEN_LIVEPATCH_NAME_SIZE** bytes. Note that **XEN_LIVEPATCH_NAME_SIZE**
   includes the NUL terminator.
 * `len` - Virtual address of where to write the length of each unique name
   of the payload. Caller *MUST* allocate up to `nr` of them. Each *MUST* be
   of sizeof(uint32_t) (4 bytes).
 * `metadata` - Virtual address of where to write the metadata of the payloads.
   Caller *MUST* allocate enough space to be able to store all received data
   (i.e. total allocated space *MUST* match the `metadata_total_size` value
   provided by the hypervisor). Individual payload metadata string can be of
   arbitrary length. The metadata string format is: key=value\\0...key=value\\0.
 * `metadata_len` - Virtual address of where to write the length of each metadata
   string of the payload. Caller *MUST* allocate up to `nr` of them. Each *MUST*
   be of sizeof(uint32_t) (4 bytes).

If the hypercall returns an positive number, it is the number (upto `nr`
provided to the hypercall) of the payloads returned, along with `nr` updated
with the number of remaining payloads, `version` updated (it may be the same
across hypercalls - if it varies the data is stale and further calls could
fail), `name_total_size` and `metadata_total_size` containing total sizes of
transferred data for both the arrays.
The `status`, `name`, `len`, `metadata` and `metadata_len` are updated at their
designed index value (`idx`) with the returned value of data.

If the hypercall returns -XEN_E2BIG the `nr` is too big and should be
lowered.

If the hypercall returns an zero value there are no more payloads.

Note that due to the asynchronous nature of hypercalls the control domain might
have added or removed a number of payloads making this information stale. It is
the responsibility of the toolstack to use the `version` field to check
between each invocation. if the version differs it should discard the stale
data and start from scratch. It is OK for the toolstack to use the new
`version` field.

The `struct xen_livepatch_status` structure contains an status of payload which includes:

 * `status` - indicates the current status of the payload:
   * *LIVEPATCH_STATUS_CHECKED* (1) loaded and the ELF payload safety checks passed.
   * *LIVEPATCH_STATUS_APPLIED* (2) loaded, checked, and applied.
   *  No other value is possible.
 * `rc` - -XEN_EXX type errors encountered while performing the last
   LIVEPATCH_ACTION_* operation. The normal values can be zero or -XEN_EAGAIN which
   respectively mean: success or operation in progress. Other values
   imply an error occurred. If there is an error in `rc`, `status` will **NOT**
   have changed.

The structure is as follow:

    struct xen_sysctl_livepatch_list {
        uint32_t version;                       /* OUT: Hypervisor stamps value.
                                                   If varies between calls, we are
                                                   getting stale data. */
        uint32_t idx;                           /* IN: Index into hypervisor list. */
        uint32_t nr;                            /* IN: How many status, names, and len
                                                   should be filled out. Can be zero to get
                                                   amount of payloads and version.
                                                   OUT: How many payloads left. */
        uint32_t pad;                           /* IN: Must be zero. */
        uint32_t name_total_size;               /* OUT: Total size of all transfer names */
        uint32_t metadata_total_size;           /* OUT: Total size of all transfer metadata */
        XEN_GUEST_HANDLE_64(xen_livepatch_status_t) status;  /* OUT. Must have enough
                                                   space allocate for nr of them. */
        XEN_GUEST_HANDLE_64(char) name;         /* OUT: Array of names. Each member
                                                   may have an arbitrary length up to
                                                   XEN_LIVEPATCH_NAME_SIZE bytes. Must have
                                                   nr of them. */
        XEN_GUEST_HANDLE_64(uint32) len;        /* OUT: Array of lengths of name's.
                                                   Must have nr of them. */
        XEN_GUEST_HANDLE_64(char) metadata;     /* OUT: Array of metadata strings. Each
                                                   member may have an arbitrary length.
                                                   Must have nr of them. */
        XEN_GUEST_HANDLE_64(uint32) metadata_len;  /* OUT: Array of lengths of metadata's.
                                                      Must have nr of them. */

    };

### XEN_SYSCTL_LIVEPATCH_ACTION (3)

Perform an operation on the payload structure referenced by the `name` field.
The operation request is asynchronous and the status should be retrieved
by using either **XEN_SYSCTL_LIVEPATCH_GET** or **XEN_SYSCTL_LIVEPATCH_LIST** hypercall.

The caller provides:

 * A `struct xen_livepatch_name` `name` containing the unique name.
 * `cmd` The command requested:
  * *LIVEPATCH_ACTION_UNLOAD* (1) Unload the payload.
   Any further hypercalls against the `name` will result in failure unless
   **XEN_SYSCTL_LIVEPATCH_UPLOAD** hypercall is perfomed with same `name`.
  * *LIVEPATCH_ACTION_REVERT* (2) Revert the payload. If the operation takes
  more time than the upper bound of time the `rc` in `xen_livepatch_status`
  retrieved via **XEN_SYSCTL_LIVEPATCH_GET** will be -XEN_EBUSY.
  * *LIVEPATCH_ACTION_APPLY* (3) Apply the payload. If the operation takes
  more time than the upper bound of time the `rc` in `xen_livepatch_status`
  retrieved via **XEN_SYSCTL_LIVEPATCH_GET** will be -XEN_EBUSY.
  * *LIVEPATCH_ACTION_REPLACE* (4) Revert all applied payloads and apply this
  payload. If the operation takes more time than the upper bound of time
  the `rc` in `xen_livepatch_status` retrieved via **XEN_SYSCTL_LIVEPATCH_GET**
  will be -XEN_EBUSY.
 * `time` The upper bound of time (ns) the cmd should take. Zero means to use
   the hypervisor default. If within the time the operation does not succeed
   the operation would go in error state.
 * `flags` provides additional parameters for an action:
  * *LIVEPATCH_ACTION_APPLY_NODEPS* (1) Apply action ignores inter-module
  buildid dependency. Checks only if module is built for given hypervisor by
  comparing buildid.
 * `pad` - *MUST* be zero.

The return value will be zero unless the provided fields are incorrect.

The structure is as follow:

    #define LIVEPATCH_ACTION_UNLOAD  1
    #define LIVEPATCH_ACTION_REVERT  2
    #define LIVEPATCH_ACTION_APPLY   3
    #define LIVEPATCH_ACTION_REPLACE 4
    struct xen_sysctl_livepatch_action {
        xen_livepatch_name_t name;              /* IN, name of the patch. */
        uint32_t cmd;                           /* IN: LIVEPATCH_ACTION_* */
        uint32_t time;                          /* IN: If zero then uses */
                                                /* hypervisor default. */
                                                /* Or upper bound of time (ns) */
                                                /* for operation to take. */
        uint32_t flags;                         /* IN: action flags. */
                                                /* Provide additional parameters */
                                                /* for an action. */
        uint32_t pad;                           /* IN: Always zero. */
    };


## State diagrams of LIVEPATCH_ACTION commands.

There is a strict ordering state of what the commands can be.
The LIVEPATCH_ACTION prefix has been dropped to easy reading and
does not include the LIVEPATCH_STATES:

                 /->\
                 \  /
    UNLOAD <--- CHECK ---> REPLACE|APPLY --> REVERT --\
                   \                                  |
                    \-------------------<-------------/

## State transition table of LIVEPATCH_ACTION commands and LIVEPATCH_STATUS.

Note that:

 - The CHECKED state is the starting one achieved with *XEN_SYSCTL_LIVEPATCH_UPLOAD* hypercall.
 - The REVERT operation on success will automatically move to the CHECKED state.
 - There are two STATES: CHECKED and APPLIED.
 - There are four actions (aka commands): APPLY, REPLACE, REVERT, and UNLOAD.

The state transition table of valid states and action states:

    +---------+---------+--------------------------------+-------+--------+
    | ACTION  | Current | Result                         |   Next STATE:  |
    | ACTION  | STATE   |                                |CHECKED|APPLIED |
    +---------+----------+-------------------------------+-------+--------+
    | UNLOAD  | CHECKED | Unload payload. Always works.  |       |        |
    |         |         | No next states.                |       |        |
    +---------+---------+--------------------------------+-------+--------+
    | APPLY   | CHECKED | Apply payload (success).       |       |   x    |
    +---------+---------+--------------------------------+-------+--------+
    | APPLY   | CHECKED | Apply payload (error|timeout)  |   x   |        |
    +---------+---------+--------------------------------+-------+--------+
    | REPLACE | CHECKED | Revert payloads and apply new  |       |   x    |
    |         |         | payload with success.          |       |        |
    +---------+---------+--------------------------------+-------+--------+
    | REPLACE | CHECKED | Revert payloads and apply new  |   x   |        |
    |         |         | payload with error.            |       |        |
    +---------+---------+--------------------------------+-------+--------+
    | REVERT  | APPLIED | Revert payload (success).      |   x   |        |
    +---------+---------+--------------------------------+-------+--------+
    | REVERT  | APPLIED | Revert payload (error|timeout) |       |   x    |
    +---------+---------+--------------------------------+-------+--------+

All the other state transitions are invalid.

## Sequence of events.

The normal sequence of events is to:

 1. *XEN_SYSCTL_LIVEPATCH_UPLOAD* to upload the payload. If there are errors *STOP* here.
 2. *XEN_SYSCTL_LIVEPATCH_GET* to check the `->rc`. If *-XEN_EAGAIN* spin. If zero go to next step.
 3. *XEN_SYSCTL_LIVEPATCH_ACTION* with *LIVEPATCH_ACTION_APPLY* to apply the patch.
 4. *XEN_SYSCTL_LIVEPATCH_GET* to check the `->rc`. If in *-XEN_EAGAIN* spin. If zero exit with success.


## Addendum

Implementation quirks should not be discussed in a design document.

However these observations can provide aid when developing against this
document.


### Alternative assembler

Alternative assembler is a mechanism to use different instructions depending
on what the CPU supports. This is done by providing multiple streams of code
that can be patched in - or if the CPU does not support it - padded with
`nop` operations. The alternative assembler macros cause the compiler to
expand the code to place a most generic code in place - emit a special
ELF .section header to tag this location. During run-time the hypervisor
can leave the areas alone or patch them with an better suited opcodes.

Note that patching functions that copy to or from guest memory requires
to support alternative support. For example this can be due to SMAP
(specifically *stac* and *clac* operations) which is enabled on Broadwell
and later architectures. It may be related to other alternative instructions.

### When to patch

During the discussion on the design two candidates bubbled where
the call stack for each CPU would be deterministic. This would
minimize the chance of the patch not being applied due to safety
checks failing. Safety checks such as not patching code which
is on the stack - which can lead to corruption.

#### Rendezvous code instead of stop_machine for patching

The hypervisor's time rendezvous code runs synchronously across all CPUs
every second. Using the `stop_machine` to patch can stall the time rendezvous
code and result in NMI. As such having the patching be done at the tail
of rendezvous code should avoid this problem.

However the entrance point for that code is `do_softirq ->
timer_softirq_action -> time_calibration` which ends up calling
`on_selected_cpus` on remote CPUs.

The remote CPUs receive CALL_FUNCTION_VECTOR IPI and execute the
desired function.

#### Before entering the guest code.

Before we call VMXResume we check whether any soft IRQs need to be executed.
This is a good spot because all Xen stacks are effectively empty at
that point.

To randezvous all the CPUs an barrier with an maximum timeout (which
could be adjusted), combined with forcing all other CPUs through the
hypervisor with IPIs, can be utilized to execute lockstep instructions
on all CPUs.

The approach is similar in concept to `stop_machine` and the time rendezvous
but is time-bound. However the local CPU stack is much shorter and
a lot more deterministic.

This is implemented in the Xen hypervisor.

### Compiling the hypervisor code

Hotpatch generation often requires support for compiling the target
with `-ffunction-sections` / `-fdata-sections`.  Changes would have to
be done to the linker scripts to support this.

### Generation of Live Patch ELF payloads

The design of that is not discussed in this design.

This is implemented in a seperate tool which lives in a seperate
GIT repo.

Currently it resides at https://xenbits.xen.org/git-http/livepatch-build-tools.git

### Exception tables and symbol tables growth

We may need support for adapting or augmenting exception tables if
patching such code.  Hotpatches may need to bring their own small
exception tables (similar to how Linux modules support this).

If supporting hotpatches that introduce additional exception-locations
is not important, one could also change the exception table in-place
and reorder it afterwards.

As found almost every patch (XSA) to a non-trivial function requires
additional entries in the exception table and/or the bug frames.

This is implemented in the Xen hypervisor.

### .rodata sections

The patching might require strings to be updated as well. As such we must be
also able to patch the strings as needed. This sounds simple - but the compiler
has a habit of coalescing strings that are the same - which means if we in-place
alter the strings - other users will be inadvertently affected as well.

This is also where pointers to functions live - and we may need to patch this
as well. And switch-style jump tables.

To guard against that we must be prepared to do patching similar to
trampoline patching or in-line depending on the flavour. If we can
do in-line patching we would need to:

 * Alter `.rodata` to be writeable.
 * Inline patch.
 * Alter `.rodata` to be read-only.

If are doing trampoline patching we would need to:

 * Allocate a new memory location for the string.
 * All locations which use this string will have to be updated to use the
   offset to the string.
 * Mark the region RO when we are done.

The trampoline patching is implemented in the Xen hypervisor.

### .bss and .data sections.

In place patching writable data is not suitable as it is unclear what should be done
depending on the current state of data. As such it should not be attempted.

However, functions which are being patched can bring in changes to strings
(.data or .rodata section changes), or even to .bss sections.

As such the ELF payload can introduce new .rodata, .bss, and .data sections.
Patching in the new function will end up also patching in the new .rodata
section and the new function will reference the new string in the new
.rodata section.

This is implemented in the Xen hypervisor.

### Security

Only the privileged domain should be allowed to do this operation.

### Live patch interdependencies

Live patch patches interdependencies are tricky.

There are the ways this can be addressed:
 * A single large patch that subsumes and replaces all previous ones.
   Over the life-time of patching the hypervisor this large patch
   grows to accumulate all the code changes.
 * Hotpatch stack - where an mechanism exists that loads the hotpatches
   in the same order they were built in. We would need an build-id
   of the hypevisor to make sure the hot-patches are build against the
   correct build.
 * Payload containing the old code to check against that. That allows
   the hotpatches to be loaded indepedently (if they don't overlap) - or
   if the old code also containst previously patched code - even if they
   overlap.

The disadvantage of the first large patch is that it can grow over
time and not provide an bisection mechanism to identify faulty patches.

The hot-patch stack puts stricts requirements on the order of the patches
being loaded and requires an hypervisor build-id to match against.

The old code allows much more flexibility and an additional guard,
but is more complex to implement.

The second option which requires an build-id of the hypervisor
is implemented in the Xen hypervisor.

Specifically each payload has three build-id ELF notes:
 * The build-id of the payload itself (generated via --build-id).
 * The build-id of the Xen hypervisor it depends on (extracted from the
   hypervisor during build time).
 * The build-id of the payload it depends on (extracted from the
   the previous payload or hypervisor during build time).

This means that every payload depends on the hypervisor build-id and on
the build-id of the previous payload in the stack.
The very first payload depends on the hypervisor build-id only.

# Not Yet Done

This is for further development of live patching.

## TODO Goals

The implementation must also have a mechanism for (in no particular order):

 * Be able to lookup in the Xen hypervisor the symbol names of functions from the
   ELF payload. (Either as `symbol` or `symbol`+`offset`).
 * Be able to patch .rodata, .bss, and .data sections.
 * Deal with NMI/MCE checks during patching instead of ignoring them.
 * Further safety checks (blacklist of which functions cannot be patched, check
   the stack, make sure the payload is built with same compiler as hypervisor).
   Specifically we want to make sure that live patching codepaths cannot be patched.
 * NOP out the code sequence if `new_size` is zero.
 * Deal with other relocation types:  `R_X86_64_[8,16,32,32S]`, `R_X86_64_PC[8,16,64]`
   in payload file.

### Handle inlined \__LINE__

This problem is related to hotpatch construction
and potentially has influence on the design of the hotpatching
infrastructure in Xen.

For example:

We have file1.c with functions f1 and f2 (in that order).  f2 contains a
BUG() (or WARN()) macro and at that point embeds the source line number
into the generated code for f2.

Now we want to hotpatch f1 and the hotpatch source-code patch adds 2
lines to f1 and as a consequence shifts out f2 by two lines.  The newly
constructed file1.o will now contain differences in both binary
functions f1 (because we actually changed it with the applied patch) and
f2 (because the contained BUG macro embeds the new line number).

Without additional information, an algorithm comparing file1.o before
and after hotpatch application will determine both functions to be
changed and will have to include both into the binary hotpatch.

Options:

1. Transform source code patches for hotpatches to be line-neutral for
   each chunk.  This can be done in almost all cases with either
   reformatting of the source code or by introducing artificial
   preprocessor "#line n" directives to adjust for the introduced
   differences.

   This approach is low-tech and simple.  Potentially generated
   backtraces and existing debug information refers to the original
   build and does not reflect hotpatching state except for actually
   hotpatched functions but should be mostly correct.

2. Ignoring the problem and living with artificially large hotpatches
   that unnecessarily patch many functions.

   This approach might lead to some very large hotpatches depending on
   content of specific source file.  It may also trigger pulling in
   functions into the hotpatch that cannot reasonable be hotpatched due
   to limitations of a hotpatching framework (init-sections, parts of
   the hotpatching framework itself, ...) and may thereby prevent us
   from patching a specific problem.

   The decision between 1. and 2. can be made on a patch--by-patch
   basis.

3. Introducing an indirection table for storing line numbers and
   treating that specially for binary diffing. Linux may follow
   this approach.

   We might either use this indirection table for runtime use and patch
   that with each hotpatch (similarly to exception tables) or we might
   purely use it when building hotpatches to ignore functions that only
   differ at exactly the location where a line-number is embedded.

For BUG(), WARN(), etc., the line number is embedded into the bug frame, not
the function itself.

Similar considerations are true to a lesser extent for \__FILE__, but it
could be argued that file renaming should be done outside of hotpatches.

## Signature checking requirements.

The signature checking requires that the layout of the data in memory
**MUST** be same for signature to be verified. This means that the payload
data layout in ELF format **MUST** match what the hypervisor would be
expecting such that it can properly do signature verification.

The signature is based on the all of the payloads continuously laid out
in memory. The signature is to be appended at the end of the ELF payload
prefixed with the string '`~Module signature appended~\n`', followed by
an signature header then followed by the signature, key identifier, and signers
name.

Specifically the signature header would be:

    #define PKEY_ALGO_DSA       0
    #define PKEY_ALGO_RSA       1

    #define PKEY_ID_PGP         0 /* OpenPGP generated key ID */
    #define PKEY_ID_X509        1 /* X.509 arbitrary subjectKeyIdentifier */

    #define HASH_ALGO_MD4          0
    #define HASH_ALGO_MD5          1
    #define HASH_ALGO_SHA1         2
    #define HASH_ALGO_RIPE_MD_160  3
    #define HASH_ALGO_SHA256       4
    #define HASH_ALGO_SHA384       5
    #define HASH_ALGO_SHA512       6
    #define HASH_ALGO_SHA224       7
    #define HASH_ALGO_RIPE_MD_128  8
    #define HASH_ALGO_RIPE_MD_256  9
    #define HASH_ALGO_RIPE_MD_320 10
    #define HASH_ALGO_WP_256      11
    #define HASH_ALGO_WP_384      12
    #define HASH_ALGO_WP_512      13
    #define HASH_ALGO_TGR_128     14
    #define HASH_ALGO_TGR_160     15
    #define HASH_ALGO_TGR_192     16

    struct elf_payload_signature {
	    u8	algo;		/* Public-key crypto algorithm PKEY_ALGO_*. */
	    u8	hash;		/* Digest algorithm: HASH_ALGO_*. */
	    u8	id_type;	/* Key identifier type PKEY_ID*. */
	    u8	signer_len;	/* Length of signer's name */
	    u8	key_id_len;	/* Length of key identifier */
	    u8	__pad[3];
	    __be32	sig_len;	/* Length of signature data */
    };

(Note that this has been borrowed from Linux module signature code.).


### .bss and .data sections.

In place patching writable data is not suitable as it is unclear what should be done
depending on the current state of data. As such it should not be attempted.

That said we should provide hook functions so that the existing data
can be changed during payload application.

To guarantee safety we disallow re-applying an payload after it has been
reverted. This is because we cannot guarantee that the state of .bss
and .data to be exactly as it was during loading. Hence the administrator
MUST unload the payload and upload it again to apply it.

There is an exception to this: if the payload only has .livepatch.funcs;
and the .data or .bss sections are of zero length.

### Inline patching

The hypervisor should verify that the in-place patching would fit within
the code or data.

### Trampoline (e9 opcode), x86

The e9 opcode used for jmpq uses a 32-bit signed displacement. That means
we are limited to up to 2GB of virtual address to place the new code
from the old code. That should not be a problem since Xen hypervisor has
a very small footprint.

However if we need - we can always add two trampolines. One at the 2GB
limit that calls the next trampoline.

Please note there is a small limitation for trampolines in
function entries: The target function (+ trailing padding) must be able
to accomodate the trampoline. On x86 with +-2 GB relative jumps,
this means 5 bytes are required which means that `old_size` **MUST** be
at least five bytes if patching in trampoline.

Depending on compiler settings, there are several functions in Xen that
are smaller (without inter-function padding).

    readelf -sW xen-syms | grep " FUNC " | \
        awk '{ if ($3 < 5) print $3, $4, $5, $8 }'

    ...
    3 FUNC LOCAL wbinvd_ipi
    3 FUNC LOCAL shadow_l1_index
    ...

A compile-time check for, e.g., a minimum alignment of functions or a
runtime check that verifies symbol size (+ padding to next symbols) for
that in the hypervisor is advised.

The tool for generating payloads currently does perform a compile-time
check to ensure that the function to be replaced is large enough.

#### Trampoline, ARM

The unconditional branch instruction (for the encoding see the
DDI 0406C.c and DDI 0487A.j Architecture Reference Manual's).
with proper offset is used for an unconditional branch to the new code.
This means that that `old_size` **MUST** be at least four bytes if patching
in trampoline.

The instruction offset is limited on ARM32 to +/- 32MB to displacement
and on ARM64 to +/- 128MB displacement.

The new code is placed in the 8M - 10M virtual address space while the
Xen code is in 2M - 4M. That gives us enough space.

The hypervisor also checks the displacement during loading of the payload.