summaryrefslogtreecommitdiff
path: root/buckets/doc_page_io.txt
diff options
context:
space:
mode:
Diffstat (limited to 'buckets/doc_page_io.txt')
-rw-r--r--buckets/doc_page_io.txt166
1 files changed, 0 insertions, 166 deletions
diff --git a/buckets/doc_page_io.txt b/buckets/doc_page_io.txt
deleted file mode 100644
index 7e8d885f1..000000000
--- a/buckets/doc_page_io.txt
+++ /dev/null
@@ -1,166 +0,0 @@
-
-From dgaudet@arctic.org Fri Feb 20 00:36:52 1998
-Date: Fri, 20 Feb 1998 00:35:37 -0800 (PST)
-From: Dean Gaudet <dgaudet@arctic.org>
-To: new-httpd@apache.org
-Subject: page-based i/o
-X-Comment: Visit http://www.arctic.org/~dgaudet/legal for information regarding copyright and disclaimer.
-Reply-To: new-httpd@apache.org
-
-Ed asked me for more details on what I mean when I talk about "paged based
-zero copy i/o".
-
-While writing mod_mmap_static I was thinking about the primitives that the
-core requires of the filesystem. What exactly is it that ties us into the
-filesystem? and how would we abstract it? The metadata (last modified
-time, file length) is actually pretty easy to abstract. It's also easy to
-define an "index" function so that MultiViews and such can be implemented.
-And with layered I/O we can hide the actual details of how you access
-these "virtual" files.
-
-But therein lies an inefficiency. If we had only bread() for reading
-virtual files, then we would enforce at least one copy of the data.
-bread() supplies the place that the caller wants to see the data, and so
-the bread() code has to copy it. But there's very little reason that
-bread() callers have to supply the buffer... bread() itself could supply
-the buffer. Call this new interface page_read(). It looks something like
-this:
-
- typedef struct {
- const void *data;
- size_t data_len; /* amt of data on page which is valid */
- ... other stuff necessary for managing the page pool ...
- } a_page_head;
-
- /* returns NULL if an error or EOF occurs, on EOF errno will be
- * set to 0
- */
- a_page_head *page_read(BUFF *fb);
-
- /* queues entire page for writing, returns 0 on success, -1 on
- * error
- */
- int page_write(BUFF *fb, a_page_head *);
-
-It's very important that a_page_head structures point to the data page
-rather than be part of the data page. This way we can build a_page_head
-structures which refer to parts of mmap()d memory.
-
-This stuff is a little more tricky to do, but is a big win for performance.
-With this integrated into our layered I/O it means that we can have
-zero-copy performance while still getting the advantages of layering.
-
-But note I'm glossing over a bunch of details... like the fact that we
-have to decide if a_page_heads are shared data, and hence need reference
-counting (i.e. I said "queues for writing" up there, which means some
-bit of the a_page_head data has to be kept until its actually written).
-Similarly for the page data.
-
-There are other tricks in this area that we can take advantage of --
-like interprocess communication on architectures that do page flipping.
-On these boxes if you write() something that's page-aligned and page-sized
-to a pipe or unix socket, and the other end read()s into a page-aligned
-page-sized buffer then the kernel can get away without copying any data.
-It just marks the two pages as shared copy-on-write, and only when
-they're written to will the copy be made. So to make this work, your
-writer uses a ring of 2+ page-aligned/sized buffers so that it's not
-writing on something the reader is still reading.
-
-Dean
-
-----
-
-For details on HPUX and avoiding extra data copies, see
-<ftp://ftp.cup.hp.com/dist/networking/briefs/copyavoid.pdf>.
-
-(note that if you get the postscript version instead, you have to
-manually edit it to remove the front page before any version of
-ghostscript that I have used will read it)
-
-----
-
-I've been told by an engineer in Sun's TCP/IP group that zero-copy TCP
-in Solaris 2.6 occurs when:
-
- - you've got the right interface card (OC-12 ATM card I think)
- - you use write()
- - your write buffer is 16k aligned and a multiple of 16k in size
-
-We currently get the 16k stuff for free by using mmap(). But sun's
-current code isn't smart enough to deal with our initial writev()
-of the headers and first part of the response.
-
-----
-
-Systems that have a system call to efficiently send the contents of a
-descriptor across the network. This is probably the single best way
-to do static content on systems that support it.
-
-HPUX: (10.30 and on)
-
- ssize_t sendfile(int s, int fd, off_t offset, size_t nbytes,
- const struct iovec *hdtrl, int flags);
-
- (allows you to add headers and trailers in the form of iovec
- structs) Marc has a man page; ask if you want a copy. Not included
- due to copyright issues. man page also available from
- http://docs.hp.com/ (in particular,
- http://docs.hp.com:80/dynaweb/hpux11/hpuxen1a/rvl3en1a/@Generic__BookTextView/59894;td=3 )
-
-Windows NT:
-
- BOOL TransmitFile( SOCKET hSocket,
- HANDLE hFile,
- DWORD nNumberOfBytesToWrite,
- DWORD nNumberOfBytesPerSend,
- LPOVERLAPPED lpOverlapped,
- LPTRANSMIT_FILE_BUFFERS lpTransmitBuffers,
- DWORD dwFlags
- );
-
- (does it start from the current position in the handle? I would
- hope so, or else it is pretty dumb.)
-
- lpTransmitBuffers allows for headers and trailers.
-
- Documentation at:
-
- http://premium.microsoft.com/msdn/library/sdkdoc/wsapiref_3pwy.htm
- http://premium.microsoft.com/msdn/library/conf/html/sa8ff.htm
-
- Even less related to page based IO: just context switching:
- AcceptEx does an accept(), and returns the start of the
- input data. see:
-
- http://premium.microsoft.com/msdn/library/sdkdoc/pdnds/sock2/wsapiref_17jm.htm
-
- What this means is you require one less syscall to do a
- typical request, especially if you have a cache of handles
- so you don't have to do an open or close. Hmm. Interesting
- question: then, if TransmitFile starts from the current
- position, you need a mutex around the seek and the
- TransmitFile. If not, you are just limited (eg. byte
- ranges) in what you can use it for.
-
- Also note that TransmitFile can specify TF_REUSE_SOCKET, so that
- after use the same socket handle can be passed to AcceptEx.
- Obviously only good where we don't have a persistent connection
- to worry about.
-
-----
-
-Note that all this is shot to bloody hell by HTTP-NG's multiplexing.
-If fragment sizes are big enough, it could still be worthwhile to
-do copy avoidence. It also causes performance issues because of
-its credit system that limits how much you can write in a single
-chunk.
-
-Don't tell me that if HTTP-NG becomes popular we will seen vendors
-embedding SMUX (or whatever multiplexing is used) in the kernel to
-get around this stuff. There we go, Apache with a loadable kernel
-module.
-
-----
-
-Larry McVoy's document for SGI regarding sendfile/TransmitFile:
-ftp://ftp.bitmover.com/pub/splice.ps.gz