summaryrefslogtreecommitdiff
path: root/docs/specs/libxc-migration-stream.pandoc
blob: 8aeab3b11b7f17a0aa7eda9a0fdefbbb35e21df6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
% libxenctrl (libxc) Domain Image Format
% David Vrabel <<david.vrabel@citrix.com>>
  Andrew Cooper <<andrew.cooper3@citrix.com>>
  Wen Congyang <<wency@cn.fujitsu.com>>
  Yang Hongyang <<hongyang.yang@easystack.cn>>
% Revision 3

Introduction
============

Purpose
-------

The _domain save image_ is the context of a running domain used for
snapshots of a domain or for transferring domains between hosts during
migration.

There are a number of problems with the format of the domain save
image used in Xen 4.4 and earlier (the _legacy format_).

* Dependant on toolstack word size.  A number of fields within the
  image are native types such as `unsigned long` which have different
  sizes between 32-bit and 64-bit toolstacks.  This prevents domains
  from being migrated between hosts running 32-bit and 64-bit
  toolstacks.

* There is no header identifying the image.

* The image has no version information.

A new format that addresses the above is required.

ARM does not yet have have a domain save image format specified and
the format described in this specification should be suitable.

Not Yet Included
----------------

The following features are not yet fully specified and will be
included in a future draft.

* Page data compression.

* ARM


Overview
========

The image format consists of two main sections:

* _Headers_
* _Records_

Headers
-------

There are two headers: the _image header_, and the _domain header_.
The image header describes the format of the image (version etc.).
The _domain header_ contains general information about the domain
(architecture, type etc.).

Records
-------

The main part of the format is a sequence of different _records_.
Each record type contains information about the domain context.  At a
minimum there is a END record marking the end of the records section.


Fields
------

All the fields within the headers and records have a fixed width.

Fields are always aligned to their size.

Padding and reserved fields are set to zero on save and must be
ignored during restore.

Integer (numeric) fields in the image header are always in big-endian
byte order.

Integer fields in the domain header and in the records are in the
endianness described in the image header (which will typically be the
native ordering).

\clearpage

Headers
=======

Image Header
------------

The image header identifies an image as a Xen domain save image.  It
includes the version of this specification that the image complies
with.

Tools supporting version _V_ of the specification shall always save
images using version _V_.  Tools shall support restoring from version
_V_.  If the previous Xen release produced version _V_ - 1 images,
tools shall supported restoring from these.  Tools may additionally
support restoring from earlier versions.

The marker field can be used to distinguish between legacy images and
those corresponding to this specification.  Legacy images will have at
one or more zero bits within the first 8 octets of the image.

Fields within the image header are always in _big-endian_ byte order,
regardless of the setting of the endianness bit.

     0     1     2     3     4     5     6     7 octet
    +-------------------------------------------------+
    | marker                                          |
    +-----------------------+-------------------------+
    | id                    | version                 |
    +-----------+-----------+-------------------------+
    | options   | (reserved)                          |
    +-----------+-------------------------------------+


--------------------------------------------------------------------
Field       Description
----------- --------------------------------------------------------
marker      0xFFFFFFFFFFFFFFFF.

id          0x58454E46 ("XENF" in ASCII).

version     0x00000003.  The version of this specification.

options     bit 0: Endianness.  0 = little-endian, 1 = big-endian.

            bit 1-15: Reserved.
--------------------------------------------------------------------

The endianness shall be 0 (little-endian) for images generated on an
i386, x86_64, or arm host.

\clearpage

Domain Header
-------------

The domain header includes general properties of the domain.

     0      1     2     3     4     5     6     7 octet
    +-----------------------+-----------+-------------+
    | type                  | page_shift| (reserved)  |
    +-----------------------+-----------+-------------+
    | xen_major             | xen_minor               |
    +-----------------------+-------------------------+

--------------------------------------------------------------------
Field       Description
----------- --------------------------------------------------------
type        0x0000: Reserved.

            0x0001: x86 PV.

            0x0002: x86 HVM.

            0x0003 - 0xFFFFFFFF: Reserved.

page_shift  Size of a guest page as a power of two.

            i.e., page size = 2 ^page_shift^.

xen_major   The Xen major version when this image was saved.

xen_minor   The Xen minor version when this image was saved.
--------------------------------------------------------------------

The legacy stream conversion tool writes a `xen_major` version of 0, and sets
`xen_minor` to the version of itself.

\clearpage

Records
=======

A record has a record header, type specific data and a trailing
footer.  If `body_length` is not a multiple of 8, the body is padded
with zeroes to align the end of the record on an 8 octet boundary.

     0     1     2     3     4     5     6     7 octet
    +-----------------------+-------------------------+
    | type                  | body_length             |
    +-----------+-----------+-------------------------+
    | body...                                         |
    ...
    |           | padding (0 to 7 octets)             |
    +-----------+-------------------------------------+

--------------------------------------------------------------------
Field        Description
-----------  -------------------------------------------------------
type         0x00000000: END

             0x00000001: PAGE_DATA

             0x00000002: X86_PV_INFO

             0x00000003: X86_PV_P2M_FRAMES

             0x00000004: X86_PV_VCPU_BASIC

             0x00000005: X86_PV_VCPU_EXTENDED

             0x00000006: X86_PV_VCPU_XSAVE

             0x00000007: SHARED_INFO

             0x00000008: X86_TSC_INFO

             0x00000009: HVM_CONTEXT

             0x0000000A: HVM_PARAMS

             0x0000000B: TOOLSTACK (deprecated)

             0x0000000C: X86_PV_VCPU_MSRS

             0x0000000D: VERIFY

             0x0000000E: CHECKPOINT

             0x0000000F: CHECKPOINT_DIRTY_PFN_LIST (Secondary -> Primary)

             0x00000010: STATIC_DATA_END

             0x00000011: X86_CPUID_POLICY

             0x00000012: X86_MSR_POLICY

             0x00000013 - 0x7FFFFFFF: Reserved for future _mandatory_
             records.

             0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
             records.

body_length  Length in octets of the record body.

body         Content of the record.

padding      0 to 7 octets of zeros to pad the whole record to a multiple
             of 8 octets.
--------------------------------------------------------------------

Records may be _mandatory_ or _optional_.  Optional records have bit
31 set in their type.  Restoring an image that has unrecognised or
unsupported mandatory record must fail.  The contents of optional
records may be ignored during a restore.

The following sub-sections specify the record body format for each of
the record types.

\clearpage

END
----

An end record marks the end of the image, and shall be the final record
in the stream.

     0     1     2     3     4     5     6     7 octet
    +-------------------------------------------------+

The end record contains no fields; its body_length is 0.

\clearpage

PAGE_DATA
---------

The bulk of an image consists of many PAGE_DATA records containing the
memory contents.

     0     1     2     3     4     5     6     7 octet
    +-----------------------+-------------------------+
    | count (C)             | (reserved)              |
    +-----------------------+-------------------------+
    | pfn[0]                                          |
    +-------------------------------------------------+
    ...
    +-------------------------------------------------+
    | pfn[C-1]                                        |
    +-------------------------------------------------+
    | page_data[0]...                                 |
    ...
    +-------------------------------------------------+
    | page_data[N-1]...                               |
    ...
    +-------------------------------------------------+

--------------------------------------------------------------------
Field       Description
----------- --------------------------------------------------------
count       Number of pages described in this record.

pfn         An array of count PFNs and their types.

            Bit 63-60: XEN_DOMCTL_PFINFO_* type (from
            `public/domctl.h` but shifted by 32 bits)

            Bit 59-52: Reserved.

            Bit 51-0: PFN.

page_data   page_size octets of uncompressed page contents for each
            page set as present in the pfn array.
--------------------------------------------------------------------

Note: Count is strictly > 0.  N is strictly <= C and it is possible for there
to be no page_data in the record if all pfns are of invalid types.

--------------------------------------------------------------------
PFINFO type    Value      Description
-------------  ---------  ------------------------------------------
NOTAB          0x0        Normal page.

L1TAB          0x1        L1 page table page.

L2TAB          0x2        L2 page table page.

L3TAB          0x3        L3 page table page.

L4TAB          0x4        L4 page table page.

               0x5-0x8    Reserved.

L1TAB_PIN      0x9        L1 page table page (pinned).

L2TAB_PIN      0xA        L2 page table page (pinned).

L3TAB_PIN      0xB        L3 page table page (pinned).

L4TAB_PIN      0xC        L4 page table page (pinned).

BROKEN         0xD        Broken page.

XALLOC         0xE        Allocate only.

XTAB           0xF        Invalid page.
--------------------------------------------------------------------

Table: XEN_DOMCTL_PFINFO_* Page Types.

PFNs with type `BROKEN`, `XALLOC`, or `XTAB` do not have any
corresponding `page_data`.

The saver uses the `XTAB` type for PFNs that become invalid in the
guest's P2M table during a live migration[^2].

Restoring an image with unrecognised page types shall fail.

[^2]: In the legacy format, this is the list of unmapped PFNs in the
tail.

\clearpage

X86_PV_INFO
-----------

     0     1     2     3     4     5     6     7 octet
    +-----+-----+-----------+-------------------------+
    | w   | ptl | (reserved)                          |
    +-----+-----+-----------+-------------------------+

--------------------------------------------------------------------
Field            Description
-----------      ---------------------------------------------------
guest_width (w)  Guest width in octets (either 4 or 8).

pt_levels (ptl)  Number of page table levels (either 3 or 4).
--------------------------------------------------------------------

\clearpage

X86_PV_P2M_FRAMES
-----------------

     0     1     2     3     4     5     6     7 octet
    +-----+-----+-----+-----+-------------------------+
    | p2m_start_pfn (S)     | p2m_end_pfn (E)         |
    +-----+-----+-----+-----+-------------------------+
    | p2m_pfn[p2m frame containing pfn S]             |
    +-------------------------------------------------+
    ...
    +-------------------------------------------------+
    | p2m_pfn[p2m frame containing pfn E]             |
    +-------------------------------------------------+

--------------------------------------------------------------------
Field            Description
-------------    ---------------------------------------------------
p2m_start_pfn    First pfn index in the p2m_pfn array.

p2m_end_pfn      Last pfn index in the p2m_pfn array.

p2m_pfn          Array of PFNs containing the guest's P2M table, for
                 the PFN frames containing the PFN range S to E
                 (inclusive).

--------------------------------------------------------------------

\clearpage

X86_PV_VCPU_BASIC, EXTENDED, XSAVE, MSRS
----------------------------------------

The format of these records are identical.  They are all binary blobs
of data which are accessed using specific pairs of domctl hypercalls.

     0     1     2     3     4     5     6     7 octet
    +-----------------------+-------------------------+
    | vcpu_id               | (reserved)              |
    +-----------------------+-------------------------+
    | context...                                      |
    ...
    +-------------------------------------------------+

---------------------------------------------------------------------
Field            Description
-----------      ----------------------------------------------------
vcpu_id          The VCPU ID.

context          Binary data for this VCPU.
---------------------------------------------------------------------

---------------------------------------------------------------------
Record type                  Accessor hypercalls
-----------------------      ----------------------------------------
X86_PV_VCPU_BASIC            XEN_DOMCTL_{get,set}vcpucontext

X86_PV_VCPU_EXTENDED         XEN_DOMCTL_{get,set}\_ext_vcpucontext

X86_PV_VCPU_XSAVE            XEN_DOMCTL_{get,set}vcpuextstate

X86_PV_VCPU_MSRS             XEN_DOMCTL_{get,set}\_vcpu_msrs
---------------------------------------------------------------------

\clearpage

SHARED_INFO
-----------

The content of the Shared Info page.

     0     1     2     3     4     5     6     7 octet
    +-------------------------------------------------+
    | shared_info                                     |
    ...
    +-------------------------------------------------+

--------------------------------------------------------------------
Field            Description
-----------      ---------------------------------------------------
shared_info      Contents of the shared info page.  This record
                 should be exactly 1 page long.
--------------------------------------------------------------------

\clearpage

X86_TSC_INFO
------------

Domain TSC information, as accessed by the
XEN_DOMCTL_{get,set}tscinfo hypercall sub-ops.

     0     1     2     3     4     5     6     7 octet
    +------------------------+------------------------+
    | mode                   | khz                    |
    +------------------------+------------------------+
    | nsec                                            |
    +------------------------+------------------------+
    | incarnation            | (reserved)             |
    +------------------------+------------------------+

--------------------------------------------------------------------
Field            Description
-----------      ---------------------------------------------------
mode             TSC mode, TSC_MODE_* constant.

khz              TSC frequency, in kHz.

nsec             Elapsed time, in nanoseconds.

incarnation      Incarnation.
--------------------------------------------------------------------

\clearpage

HVM_CONTEXT
-----------

HVM Domain context, as accessed by the
XEN_DOMCTL_{get,set}hvmcontext hypercall sub-ops.

     0     1     2     3     4     5     6     7 octet
    +-------------------------------------------------+
    | hvm_ctx                                         |
    ...
    +-------------------------------------------------+

--------------------------------------------------------------------
Field            Description
-----------      ---------------------------------------------------
hvm_ctx          The HVM Context blob from Xen.
--------------------------------------------------------------------

\clearpage

HVM_PARAMS
----------

HVM Domain parameters, as accessed by the
HVMOP_{get,set}\_param hypercall sub-ops.

     0     1     2     3     4     5     6     7 octet
    +------------------------+------------------------+
    | count (C)              | (reserved)             |
    +------------------------+------------------------+
    | param[0].index                                  |
    +-------------------------------------------------+
    | param[0].value                                  |
    +-------------------------------------------------+
    ...
    +-------------------------------------------------+
    | param[C-1].index                                |
    +-------------------------------------------------+
    | param[C-1].value                                |
    +-------------------------------------------------+

--------------------------------------------------------------------
Field            Description
-----------      ---------------------------------------------------
count            The number of parameters contained in this record.
                 Each parameter in the record contains an index and
                 value.

param index      Parameter index.

param value      Parameter value.
--------------------------------------------------------------------

\clearpage

TOOLSTACK (deprecated)
----------------------

> *This record was only present for transitionary purposes during
>  development.  It is should not be used.*

An opaque blob provided by and supplied to the higher layers of the
toolstack (e.g., libxl) during save and restore.

     0     1     2     3     4     5     6     7 octet
    +------------------------+------------------------+
    | data                                            |
    ...
    +-------------------------------------------------+

--------------------------------------------------------------------
Field            Description
-----------      ---------------------------------------------------
data             Blob of toolstack-specific data.
--------------------------------------------------------------------

\clearpage

VERIFY
------

A verify record indicates that, while all memory has now been sent, the sender
shall send further memory records for debugging purposes.

     0     1     2     3     4     5     6     7 octet
    +-------------------------------------------------+

The verify record contains no fields; its body_length is 0.

\clearpage

CHECKPOINT
----------

A checkpoint record indicates that all the preceding records in the stream
represent a consistent view of VM state.

     0     1     2     3     4     5     6     7 octet
    +-------------------------------------------------+

The checkpoint record contains no fields; its body_length is 0

If the stream is embedded in a higher level toolstack stream, the
CHECKPOINT record marks the end of the libxc portion of the stream
and the stream is handed back to the higher level for further
processing.

The higher level stream may then hand the stream back to libxc to
process another set of records for the next consistent VM state
snapshot.  This next set of records may be terminated by another
CHECKPOINT record or an END record.

\clearpage

CHECKPOINT_DIRTY_PFN_LIST
-------------------------

A checkpoint dirty pfn list record is used to convey information about
dirty memory in the VM. It is an unordered list of PFNs. Currently only
applicable in the backchannel of a checkpointed stream. It is only used
by COLO, more detail please reference README.colo.

     0     1     2     3     4     5     6     7 octet
    +-------------------------------------------------+
    | pfn[0]                                          |
    +-------------------------------------------------+
    ...
    +-------------------------------------------------+
    | pfn[C-1]                                        |
    +-------------------------------------------------+

The count of pfns is: record->length/sizeof(uint64_t).

\clearpage

STATIC_DATA_END
---------------

A static data end record marks the end of the static state.  I.e. state which
is invariant of guest execution.


     0     1     2     3     4     5     6     7 octet
    +-------------------------------------------------+

The end record contains no fields; its body_length is 0.

\clearpage

X86_CPUID_POLICY
----------------

CPUID policy content, as accessed by the XEN_DOMCTL_{get,set}_cpu_policy
hypercall sub-ops.

     0     1     2     3     4     5     6     7 octet
    +-------------------------------------------------+
    | CPUID_policy                                    |
    ...
    +-------------------------------------------------+

--------------------------------------------------------------------
Field            Description
------------     ---------------------------------------------------
CPUID_policy     Array of xen_cpuid_leaf_t[]'s
--------------------------------------------------------------------

\clearpage

X86_MSR_POLICY
--------------

MSR policy content, as accessed by the XEN_DOMCTL_{get,set}_cpu_policy
hypercall sub-ops.

     0     1     2     3     4     5     6     7 octet
    +-------------------------------------------------+
    | MSR_policy                                      |
    ...
    +-------------------------------------------------+

--------------------------------------------------------------------
Field            Description
----------       ---------------------------------------------------
MSR_policy       Array of xen_msr_entry_t[]'s
--------------------------------------------------------------------

\clearpage


Layout
======

The set of valid records depends on the guest architecture and type.  No
assumptions should be made about the ordering or interleaving of
independent records.  Record dependencies are noted below.

Some records are used for signalling, and explicitly have zero length.  All
other records contain data relevant to the migration.  Data records with no
content should be elided on the source side, as their presence serves no
purpose, but results in extra work for the restore side.

x86 PV Guest
------------

A typical save record for an x86 PV guest image would look like:

* Image header
* Domain header
* Static data records:
    * X86_PV_INFO record
    * X86_{CPUID,MSR}_POLICY
    * STATIC_DATA_END
* X86_PV_P2M_FRAMES record
* Many PAGE_DATA records
* X86_TSC_INFO
* SHARED_INFO record
* VCPU context records for each online VCPU
    * X86_PV_VCPU_BASIC record
    * X86_PV_VCPU_EXTENDED record
    * X86_PV_VCPU_XSAVE record
    * X86_PV_VCPU_MSRS record
* END record

There are some strict ordering requirements.  The following records must
be present in the following order as each of them depends on information
present in the preceding ones.

* X86_PV_INFO record
* X86_PV_P2M_FRAMES record
* PAGE_DATA records
* VCPU records

x86 HVM Guest
-------------

A typical save record for an x86 HVM guest image would look like:

* Image header
* Domain header
* Static data records:
    * X86_{CPUID,MSR}_POLICY
    * STATIC_DATA_END
* Many PAGE_DATA records
* X86_TSC_INFO
* HVM_PARAMS
* HVM_CONTEXT
* END record

HVM_PARAMS must precede HVM_CONTEXT, as certain parameters can affect
the validity of architectural state in the context.

Compatibility with older versions
=================================

v3 compat with v2
-----------------

A v3 stream is compatible with a v2 stream, but mandates the presense of a
STATIC_DATA_END record ahead of any memory/register content.  This is to ease
the introduction of new static configuration records over time.

A v3-compatible reciever interpreting a v2 stream should infer the position of
STATIC_DATA_END based on finding the first X86_PV_P2M_FRAMES record (for PV
guests), or PAGE_DATA record (for HVM guests) and behave as if STATIC_DATA_END
had been sent.

Legacy Images (x86 only)
------------------------

Restoring legacy images from older tools shall be handled by
translating the legacy format image into this new format.

It shall not be possible to save in the legacy format.

There are two different legacy images depending on whether they were
generated by a 32-bit or a 64-bit toolstack. These shall be
distinguished by inspecting octets 4-7 in the image.  If these are
zero then it is a 64-bit image.

Toolstack  Field                            Value
---------  -----                            -----
64-bit     Bit 31-63 of the p2m_size field  0 (since p2m_size < 2^32^)
32-bit     extended-info chunk ID (PV)      0xFFFFFFFF
32-bit     Chunk type (HVM)                 < 0
32-bit     Page count (HVM)                 > 0

Table: Possible values for octet 4-7 in legacy images

This assumes the presence of the extended-info chunk which was
introduced in Xen 3.0.


Future Extensions
=================

All changes to this specification should bump the revision number in
the title block.

All changes to the image or domain headers require the image version
to be increased.

The format may be extended by adding additional record types.

Extending an existing record type must be done by adding a new record
type.  This allows old images with the old record to still be
restored.

The image header may only be extended by _appending_ additional
fields.  In particular, the `marker`, `id` and `version` fields must
never change size or location.


Errata
======

1. For compatibility with older code, the receving side of a stream should
   tolerate and ignore variable sized records with zero content.  Xen releases
   between 4.6 and 4.8 could end up generating valid HVM_PARAMS or
   X86_PV_VCPU_{EXTENDED,XSAVE,MSRS} records with zero-length content.