docs/man/xen-tscmode.7.pod


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284

=head1 NAME

xen-tscmode - Xen TSC (time stamp counter) and timekeeping discussion

=head1 OVERVIEW

As of Xen 4.0, a new config option called tsc_mode may be specified
for each domain.  The default for tsc_mode handles the vast majority
of hardware and software environments.  This document is targeted
for Xen users and administrators that may need to select a non-default
tsc_mode.

Proper selection of tsc_mode depends on an understanding not only of
the guest operating system (OS), but also of the application set that will
ever run on this guest OS.  This is because tsc_mode applies
equally to both the OS and ALL apps that are running on this
domain, now or in the future.

Key questions to be answered for the OS and/or each application are:

=over 4

=item *

Does the OS/app use the rdtsc instruction at all?
(We will explain below how to determine this.)

=item *

At what frequency is the rdtsc instruction executed by either the OS
or any running apps?  If the sum exceeds about 10,000 rdtsc instructions
per second per processor, we call this a "high-TSC-frequency"
OS/app/environment.  (This is relatively rare, and developers of OS's
and apps that are high-TSC-frequency are usually aware of it.)

=item *

If the OS/app does use rdtsc, will it behave incorrectly if "time goes
backwards" or if the frequency of the TSC suddenly changes?  If so,
we call this a "TSC-sensitive" app or OS; otherwise it is "TSC-resilient".

=back

This last is the US$64,000 question as it may be very difficult
(or, for legacy apps, even impossible) to predict all possible
failure cases.  As a result, unless proven otherwise, any app
that uses rdtsc must be assumed to be TSC-sensitive and, as we
will see, this is the default starting in Xen 4.0.

Xen's new tsc_mode parameter determines the circumstances under which
the family of rdtsc instructions are executed "natively" vs emulated.
Roughly speaking, native means rdtsc is fast but TSC-sensitive apps
may, under unpredictable circumstances, run incorrectly; emulated means
there is some performance degradation (unobservable in most cases),
but TSC-sensitive apps will always run correctly.  Prior to Xen 4.0,
all rdtsc instructions were native: "fast but potentially incorrect."
Starting at Xen 4.0, the default is that all rdtsc instructions are
"correct but potentially slow".  The tsc_mode parameter in 4.0 provides
an intelligent default but allows system administrator's to adjust
how rdtsc instructions are executed differently for different domains.

The non-default choices for tsc_mode are:

=over 4

=item * B<tsc_mode=1> (always emulate).

All rdtsc instructions are emulated; this is the best choice when
TSC-sensitive apps are running and it is necessary to understand
worst-case performance degradation for a specific hardware environment.

=item * B<tsc_mode=2> (never emulate).

This is the same as prior to Xen 4.0 and is the best choice if it
is certain that all apps running in this VM are TSC-resilient and
highest performance is required.

=item * B<tsc_mode=3> (PVRDTSCP).

This mode has been removed.

=back

If tsc_mode is left unspecified (or set to B<tsc_mode=0>), a hybrid
algorithm is utilized to ensure correctness while providing the
best performance possible given:

=over 4

=item *

the requirement of correctness,

=item *

the underlying hardware, and

=item *

whether or not the VM has been saved/restored/migrated

=back

To understand this in more detail, the rest of this document must
be read.

=head1 DETERMINING RDTSC FREQUENCY

To determine the frequency of rdtsc instructions that are emulated,
an "xl" command can be used by a privileged user of domain0.  The
command:

    # xl debug-key s; xl dmesg | tail

provides information about TSC usage in each domain where TSC
emulation is currently enabled.

=head1 TSC HISTORY

To understand tsc_mode completely, some background on TSC is required:

The x86 "timestamp counter", or TSC, is a 64-bit register on each
processor that increases monotonically.  Historically, TSC incremented
every processor cycle, but on recent processors, it increases
at a constant rate even if the processor changes frequency (for example,
to reduce processor power usage).  TSC is known by x86 programmers
as the fastest, highest-precision measurement of the passage of time
so it is often used as a foundation for performance monitoring.
And since it is guaranteed to be monotonically increasing and, at
64 bits, is guaranteed to not wraparound within 10 years, it is
sometimes used as a random number or a unique sequence identifier,
such as to stamp transactions so they can be replayed in a specific
order.

On most older SMP and early multi-core machines, TSC was not synchronized
between processors.  Thus if an application were to read the TSC on
one processor, then was moved by the OS to another processor, then read
TSC again, it might appear that "time went backwards".  This loss of
monotonicity resulted in many obscure application bugs when TSC-sensitive
apps were ported from a uniprocessor to an SMP environment; as a result,
many applications -- especially in the Windows world -- removed their
dependency on TSC and replaced their timestamp needs with OS-specific
functions, losing both performance and precision. On some more recent
generations of multi-core machines, especially multi-socket multi-core
machines, the TSC was synchronized but if one processor were to enter
certain low-power states, its TSC would stop, destroying the synchrony
and again causing obscure bugs.  This reinforced decisions to avoid use
of TSC altogether.  On the most recent generations of multi-core
machines, however, synchronization is provided across all processors
in all power states, even on multi-socket machines, and provide a
flag that indicates that TSC is synchronized and "invariant".  Thus
TSC is once again useful for applications, and even newer operating
systems are using and depending upon TSC for critical timekeeping
tasks when running on these recent machines.

We will refer to hardware that ensures TSC is both synchronized and
invariant as "TSC-safe" and any hardware on which TSC is not (or
may not remain) synchronized as "TSC-unsafe".

As a result of TSC's sordid history, two classes of applications use
TSC: old applications designed for single processors, and the most recent
enterprise applications which require high-frequency high-precision
timestamping.

We will refer to apps that might break if running on a TSC-unsafe
machine as "TSC-sensitive"; apps that don't use TSC, or do use
TSC but use it in a way that monotonicity and frequency invariance
are unimportant as "TSC-resilient".

The emergence of virtualization once again complicates the usage of
TSC.  When features such as save/restore or live migration are employed,
a guest OS and all its currently running applications may be invisibly
transported to an entirely different physical machine.  While TSC
may be "safe" on one machine, it is essentially impossible to precisely
synchronize TSC across a data center or even a pool of machines.  As
a result, when run in a virtualized environment, rare and obscure
"time going backwards" problems might once again occur for those
TSC-sensitive applications.  Worse, if a guest OS moves from, for
example, a 3GHz
machine to a 1.5GHz machine, attempts by an OS/app to measure time
intervals with TSC may without notice be incorrect by a factor of two.

The rdtsc (read timestamp counter) instruction is used to read the
TSC register.  The rdtscp instruction is a variant of rdtsc on recent
processors.  We refer to these together as the rdtsc family of instructions,
or just "rdtsc".  Instructions in the rdtsc family are non-privileged, but
privileged software may set a cpuid bit to cause all rdtsc family
instructions to trap.  This trap can be detected by Xen, which can
then transparently "emulate" the results of the rdtsc instruction and
return control to the code following the rdtsc instruction.

To provide a "safe" TSC, i.e. to ensure both TSC monotonicity and a
fixed rate, Xen provides rdtsc emulation whenever necessary or when
explicitly specified by a per-VM configuration option.  TSC emulation is
relatively slow -- roughly 15-20 times slower than the rdtsc instruction
when executed natively.  However, except when an OS or application uses
the rdtsc instruction at a high frequency (e.g. more than about 10,000 times
per second per processor), this performance degradation is not noticeable
(i.e. <0.3%).  And, TSC emulation is nearly always faster than
OS-provided alternatives (e.g. Linux's gettimeofday).  For environments
where it is certain that all apps are TSC-resilient (e.g.
"TSC-safeness" is not necessary) and highest performance is a
requirement, TSC emulation may be entirely disabled (tsc_mode==2).

The default mode (tsc_mode==0) checks TSC-safeness of the underlying
hardware on which the virtual machine is launched.  If it is
TSC-safe, rdtsc will execute at hardware speed; if it is not, rdtsc
will be emulated.  Once a virtual machine is save/restored or migrated,
however, there are two possibilities: TSC remains native IF the source
physical machine and target physical machine have the same TSC frequency
(or, for HVM/PVH guests, if TSC scaling support is available); else TSC
is emulated.  Note that, though emulated, the "apparent" TSC frequency
will be the TSC frequency of the initial physical machine, even after
migration.

Finally, tsc_mode==1 always enables TSC emulation, regardless of
the underlying physical hardware. The "apparent" TSC frequency will
be the TSC frequency of the initial physical machine, even after migration.
This mode is useful to measure any performance degradation that
might be encountered by a tsc_mode==0 domain after migration occurs,
or a tsc_mode==3 domain when it is running on TSC-unsafe hardware.

Note that while Xen ensures that an emulated TSC is "safe" across migration,
it does not ensure that it continues to tick at the same rate during
the actual migration.  As an oversimplified example, if TSC is ticking
once per second in a guest, and the guest is saved when the TSC is 1000,
then restored 30 seconds later, TSC is only guaranteed to be greater
than or equal to 1001, not precisely 1030.  This has some OS implications
as will be seen in the next section.

=head1 TSC INVARIANT BIT and NO_MIGRATE

Related to TSC emulation, the "TSC Invariant" bit is architecturally defined
in a cpuid bit on the most recent x86 processors.  If set, TSC invariance
ensures that the TSC is "safe", that is it will increment at a constant rate
regardless of power events, will be synchronized across all processors, and
was properly initialized to zero on all processors at boot-time
by system hardware/BIOS.  As long as system software never writes to TSC,
TSC will be safe and continuously incremented at a fixed rate and thus
can be used as a system "clocksource".

This bit is used by some OS's, and specifically by Linux starting with
version 2.6.30(?), to select TSC as a system clocksource.  Once selected,
TSC remains the Linux system clocksource unless manually overridden.  In
a virtualized environment, since it is not possible to synchronize TSC
across all the machines in a pool or data center, a migration may "break"
TSC as a usable clocksource; while time will not go backwards, it may
not track wallclock time well enough to avoid certain time-sensitive
consequences.  As a result, Xen can only expose the TSC Invariant bit
to a guest OS if it is certain that the domain will never migrate.
As of Xen 4.0, the "no_migrate=1" VM configuration option may be specified
to disable migration.  If no_migrate is selected and the VM is running
on a physical machine with "TSC Invariant", Linux 2.6.30+ will safely
use TSC as the system clocksource.  But, attempts to migrate or, once
saved, restore this domain will fail.

There is another cpuid-related complication: The x86 cpuid instruction is
non-privileged.  HVM domains are configured to always trap this instruction
to Xen, where Xen can "filter" the result.  In a PV OS, all cpuid instructions
have been replaced by a paravirtualized equivalent of the cpuid instruction
("pvcpuid") and also trap to Xen.  But apps in a PV guest that use a
cpuid instruction execute it directly, without a trap to Xen.  As a result,
an app may directly examine the physical TSC Invariant cpuid bit and make
decisions based on that bit.

=head1 HARDWARE TSC SCALING

Intel VMX TSC scaling and AMD SVM TSC ratio allow the guest TSC read
by guest rdtsc/p increasing in a different frequency than the host
TSC frequency.

If a HVM container in default TSC mode (tsc_mode=0) is created on a host
that provides constant TSC, its guest TSC frequency will be the same as
the host. If it is later migrated to another host that provides constant
TSC and supports Intel VMX TSC scaling/AMD SVM TSC ratio, its guest TSC
frequency will be the same before and after migration.

For above HVM container in default TSC mode (tsc_mode=0), if above
hosts support rdtscp, both guest rdtsc and rdtscp instructions will be
executed natively before and after migration.

=head1 AUTHORS

Dan Magenheimer <dan.magenheimer@oracle.com>