1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
|
# Xen transport for 9pfs version 1
## Background
9pfs is a network filesystem protocol developed for Plan 9. 9pfs is very
simple and describes a series of commands and responses. It is
completely independent from the communication channels, in fact many
clients and servers support multiple channels, usually called
"transports". For example the Linux client supports tcp and unix
sockets, fds, virtio and rdma.
### 9pfs protocol
This document won't cover the full 9pfs specification. Please refer to
this [paper] and this [website] for a detailed description of it.
However it is useful to know that each 9pfs request and response has the
following header:
struct header {
uint32_t size;
uint8_t id;
uint16_t tag;
} __attribute__((packed));
0 4 5 7
+---------+--+----+
| size |id|tag |
+---------+--+----+
- *size*
The size of the request or response.
- *id*
The 9pfs request or response operation.
- *tag*
Unique id that identifies a specific request/response pair. It is used
to multiplex operations on a single channel.
It is possible to have multiple requests in-flight at any given time.
## Rationale
This document describes a Xen based transport for 9pfs, in the
traditional PV frontend and backend format. The PV frontend is used by
the client to send commands to the server. The PV backend is used by the
9pfs server to receive commands from clients and send back responses.
The transport protocol supports multiple rings up to the maximum
supported by the backend. The size of every ring is also configurable
and can span multiple pages, up to the maximum supported by the backend
(although it cannot be more than 2MB). The design is to exploit
parallelism at the vCPU level and support multiple outstanding requests
simultaneously.
This document does not cover the 9pfs client/server design or
implementation, only the transport for it.
## Configuration
The frontend and backend are configured via Xenstore. See [header] for
the detailed Xenstore entries and the connection protocol.
## Ring Setup
The shared page has the following layout:
typedef uint32_t XEN_9PFS_RING_IDX;
struct xen_9pfs_intf {
XEN_9PFS_RING_IDX in_cons, in_prod;
uint8_t pad[56];
XEN_9PFS_RING_IDX out_cons, out_prod;
uint8_t pad[56];
uint32_t ring_order;
/* this is an array of (1 << ring_order) elements */
grant_ref_t ref[1];
};
/* not actually C compliant (ring_order changes from ring to ring) */
struct ring_data {
char in[((1 << ring_order) << PAGE_SHIFT) / 2];
char out[((1 << ring_order) << PAGE_SHIFT) / 2];
};
- **ring_order**
It represents the order of the data ring. The following list of grant
references is of `(1 << ring_order)` elements. It cannot be greater than
**max-ring-page-order**, as specified by the backend on XenBus.
- **ref[]**
The list of grant references which will contain the actual data. They are
mapped contiguosly in virtual memory. The first half of the pages is the
**in** array, the second half is the **out** array. The array must
have a power of two number of elements.
- **out** is an array used as circular buffer
It contains client requests. The producer is the frontend, the
consumer is the backend.
- **in** is an array used as circular buffer
It contains server responses. The producer is the backend, the
consumer is the frontend.
- **out_cons**, **out_prod**
Consumer and producer indices for client requests. They keep track of
how much data has been written by the frontend to **out** and how much
data has already been consumed by the backend. **out_prod** is
increased by the frontend, after writing data to **out**. **out_cons**
is increased by the backend, after reading data from **out**.
- **in_cons** and **in_prod**
Consumer and producer indices for responses. They keep track of how
much data has already been consumed by the frontend from the **in**
array. **in_prod** is increased by the backend, after writing data to
**in**. **in_cons** is increased by the frontend, after reading data
from **in**.
The binary layout of `struct xen_9pfs_intf` follows:
0 4 8 64 68 72 76
+---------+---------+-----//-----+---------+---------+---------+
| in_cons | in_prod | padding |out_cons |out_prod |ring_orde|
+---------+---------+-----//-----+---------+---------+---------+
76 80 84 4092 4096
+---------+---------+----//---+---------+
| ref[0] | ref[1] | | ref[N] |
+---------+---------+----//---+---------+
**N.B** For one page, N is maximum 991 (4096-132)/4, but given that N
needs to be a power of two, actually max N is 512. As 512 == (1 << 9),
the maximum possible max-ring-page-order value is 9.
The binary layout of the ring buffers follow:
0 ((1<<ring_order)<<PAGE_SHIFT)/2 ((1<<ring_order)<<PAGE_SHIFT)
+------------//-------------+------------//-------------+
| in | out |
+------------//-------------+------------//-------------+
## Why ring.h is not needed
Many Xen PV protocols use the macros provided by [ring.h] to manage
their shared ring for communication. This procotol does not, because it
actually comes with two rings: the **in** ring and the **out** ring.
Each of them is mono-directional, and there is no static request size:
the producer writes opaque data to the ring. On the other end, in
[ring.h] they are combined, and the request size is static and
well-known. In this protocol:
in -> backend to frontend only
out-> frontend to backend only
In the case of the **in** ring, the frontend is the consumer, and the
backend is the producer. Everything is the same but mirrored for the
**out** ring.
The producer, the backend in this case, never reads from the **in**
ring. In fact, the producer doesn't need any notifications unless the
ring is full. This version of the protocol doesn't take advantage of it,
leaving room for optimizations.
On the other end, the consumer always requires notifications, unless it
is already actively reading from the ring. The producer can figure it
out, without any additional fields in the protocol, by comparing the
indexes at the beginning and the end of the function. This is similar to
what [ring.h] does.
## Ring Usage
The **in** and **out** arrays are used as circular buffers:
0 sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2
+-----------------------------------+
|to consume| free |to consume |
+-----------------------------------+
^ ^
prod cons
0 sizeof(array)
+-----------------------------------+
| free | to consume | free |
+-----------------------------------+
^ ^
cons prod
The following functions are provided to read and write to an array:
#define MASK_XEN_9PFS_IDX(idx) ((idx) & (XEN_9PFS_RING_SIZE - 1))
static inline void xen_9pfs_read(char *buf,
XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
uint8_t *h, size_t len) {
if (*masked_cons < *masked_prod) {
memcpy(h, buf + *masked_cons, len);
} else {
if (len > XEN_9PFS_RING_SIZE - *masked_cons) {
memcpy(h, buf + *masked_cons, XEN_9PFS_RING_SIZE - *masked_cons);
memcpy((char *)h + XEN_9PFS_RING_SIZE - *masked_cons, buf, len - (XEN_9PFS_RING_SIZE - *masked_cons));
} else {
memcpy(h, buf + *masked_cons, len);
}
}
*masked_cons = _MASK_XEN_9PFS_IDX(*masked_cons + len);
}
static inline void xen_9pfs_write(char *buf,
XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
uint8_t *opaque, size_t len) {
if (*masked_prod < *masked_cons) {
memcpy(buf + *masked_prod, opaque, len);
} else {
if (len > XEN_9PFS_RING_SIZE - *masked_prod) {
memcpy(buf + *masked_prod, opaque, XEN_9PFS_RING_SIZE - *masked_prod);
memcpy(buf, opaque + (XEN_9PFS_RING_SIZE - *masked_prod), len - (XEN_9PFS_RING_SIZE - *masked_prod));
} else {
memcpy(buf + *masked_prod, opaque, len);
}
}
*masked_prod = _MASK_XEN_9PFS_IDX(*masked_prod + len);
}
The producer (the backend for **in**, the frontend for **out**) writes to the
array in the following way:
- read *cons*, *prod* from shared memory
- general memory barrier
- verify *prod* against local copy (consumer shouldn't change it)
- write to array at position *prod* up to *cons*, wrapping around the circular
buffer when necessary
- write memory barrier
- increase *prod*
- notify the other end via event channel
The consumer (the backend for **out**, the frontend for **in**) reads from the
array in the following way:
- read *prod*, *cons* from shared memory
- read memory barrier
- verify *cons* against local copy (producer shouldn't change it)
- read from array at position *cons* up to *prod*, wrapping around the circular
buffer when necessary
- general memory barrier
- increase *cons*
- notify the other end via event channel
The producer takes care of writing only as many bytes as available in the buffer
up to *cons*. The consumer takes care of reading only as many bytes as available
in the buffer up to *prod*.
## Request/Response Workflow
The client chooses one of the available rings, then it sends a request
to the other end on the *out* array, following the producer workflow
described in [Ring Usage].
The server receives the notification and reads the request, following
the consumer workflow described in [Ring Usage]. The server knows how
much to read because it is specified in the *size* field of the 9pfs
header. The server processes the request and sends back a response on
the *in* array of the same ring, following the producer workflow as
usual. Thus, every request/response pair is on one ring.
The client receives a notification and reads the response from the *in*
array. The client knows how much data to read because it is specified in
the *size* field of the 9pfs header.
[paper]: https://www.usenix.org/legacy/event/usenix05/tech/freenix/full_papers/hensbergen/hensbergen.pdf
[website]: https://github.com/chaos/diod/blob/master/protocol.md
[header]: https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/9pfs.h;hb=HEAD
[ring.h]: https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/ring.h;hb=HEAD
|