diff options
Diffstat (limited to 'buckets/doc_page_io.txt')
-rw-r--r-- | buckets/doc_page_io.txt | 166 |
1 files changed, 0 insertions, 166 deletions
diff --git a/buckets/doc_page_io.txt b/buckets/doc_page_io.txt deleted file mode 100644 index 7e8d885f1..000000000 --- a/buckets/doc_page_io.txt +++ /dev/null @@ -1,166 +0,0 @@ - -From dgaudet@arctic.org Fri Feb 20 00:36:52 1998 -Date: Fri, 20 Feb 1998 00:35:37 -0800 (PST) -From: Dean Gaudet <dgaudet@arctic.org> -To: new-httpd@apache.org -Subject: page-based i/o -X-Comment: Visit http://www.arctic.org/~dgaudet/legal for information regarding copyright and disclaimer. -Reply-To: new-httpd@apache.org - -Ed asked me for more details on what I mean when I talk about "paged based -zero copy i/o". - -While writing mod_mmap_static I was thinking about the primitives that the -core requires of the filesystem. What exactly is it that ties us into the -filesystem? and how would we abstract it? The metadata (last modified -time, file length) is actually pretty easy to abstract. It's also easy to -define an "index" function so that MultiViews and such can be implemented. -And with layered I/O we can hide the actual details of how you access -these "virtual" files. - -But therein lies an inefficiency. If we had only bread() for reading -virtual files, then we would enforce at least one copy of the data. -bread() supplies the place that the caller wants to see the data, and so -the bread() code has to copy it. But there's very little reason that -bread() callers have to supply the buffer... bread() itself could supply -the buffer. Call this new interface page_read(). It looks something like -this: - - typedef struct { - const void *data; - size_t data_len; /* amt of data on page which is valid */ - ... other stuff necessary for managing the page pool ... - } a_page_head; - - /* returns NULL if an error or EOF occurs, on EOF errno will be - * set to 0 - */ - a_page_head *page_read(BUFF *fb); - - /* queues entire page for writing, returns 0 on success, -1 on - * error - */ - int page_write(BUFF *fb, a_page_head *); - -It's very important that a_page_head structures point to the data page -rather than be part of the data page. This way we can build a_page_head -structures which refer to parts of mmap()d memory. - -This stuff is a little more tricky to do, but is a big win for performance. -With this integrated into our layered I/O it means that we can have -zero-copy performance while still getting the advantages of layering. - -But note I'm glossing over a bunch of details... like the fact that we -have to decide if a_page_heads are shared data, and hence need reference -counting (i.e. I said "queues for writing" up there, which means some -bit of the a_page_head data has to be kept until its actually written). -Similarly for the page data. - -There are other tricks in this area that we can take advantage of -- -like interprocess communication on architectures that do page flipping. -On these boxes if you write() something that's page-aligned and page-sized -to a pipe or unix socket, and the other end read()s into a page-aligned -page-sized buffer then the kernel can get away without copying any data. -It just marks the two pages as shared copy-on-write, and only when -they're written to will the copy be made. So to make this work, your -writer uses a ring of 2+ page-aligned/sized buffers so that it's not -writing on something the reader is still reading. - -Dean - ----- - -For details on HPUX and avoiding extra data copies, see -<ftp://ftp.cup.hp.com/dist/networking/briefs/copyavoid.pdf>. - -(note that if you get the postscript version instead, you have to -manually edit it to remove the front page before any version of -ghostscript that I have used will read it) - ----- - -I've been told by an engineer in Sun's TCP/IP group that zero-copy TCP -in Solaris 2.6 occurs when: - - - you've got the right interface card (OC-12 ATM card I think) - - you use write() - - your write buffer is 16k aligned and a multiple of 16k in size - -We currently get the 16k stuff for free by using mmap(). But sun's -current code isn't smart enough to deal with our initial writev() -of the headers and first part of the response. - ----- - -Systems that have a system call to efficiently send the contents of a -descriptor across the network. This is probably the single best way -to do static content on systems that support it. - -HPUX: (10.30 and on) - - ssize_t sendfile(int s, int fd, off_t offset, size_t nbytes, - const struct iovec *hdtrl, int flags); - - (allows you to add headers and trailers in the form of iovec - structs) Marc has a man page; ask if you want a copy. Not included - due to copyright issues. man page also available from - http://docs.hp.com/ (in particular, - http://docs.hp.com:80/dynaweb/hpux11/hpuxen1a/rvl3en1a/@Generic__BookTextView/59894;td=3 ) - -Windows NT: - - BOOL TransmitFile( SOCKET hSocket, - HANDLE hFile, - DWORD nNumberOfBytesToWrite, - DWORD nNumberOfBytesPerSend, - LPOVERLAPPED lpOverlapped, - LPTRANSMIT_FILE_BUFFERS lpTransmitBuffers, - DWORD dwFlags - ); - - (does it start from the current position in the handle? I would - hope so, or else it is pretty dumb.) - - lpTransmitBuffers allows for headers and trailers. - - Documentation at: - - http://premium.microsoft.com/msdn/library/sdkdoc/wsapiref_3pwy.htm - http://premium.microsoft.com/msdn/library/conf/html/sa8ff.htm - - Even less related to page based IO: just context switching: - AcceptEx does an accept(), and returns the start of the - input data. see: - - http://premium.microsoft.com/msdn/library/sdkdoc/pdnds/sock2/wsapiref_17jm.htm - - What this means is you require one less syscall to do a - typical request, especially if you have a cache of handles - so you don't have to do an open or close. Hmm. Interesting - question: then, if TransmitFile starts from the current - position, you need a mutex around the seek and the - TransmitFile. If not, you are just limited (eg. byte - ranges) in what you can use it for. - - Also note that TransmitFile can specify TF_REUSE_SOCKET, so that - after use the same socket handle can be passed to AcceptEx. - Obviously only good where we don't have a persistent connection - to worry about. - ----- - -Note that all this is shot to bloody hell by HTTP-NG's multiplexing. -If fragment sizes are big enough, it could still be worthwhile to -do copy avoidence. It also causes performance issues because of -its credit system that limits how much you can write in a single -chunk. - -Don't tell me that if HTTP-NG becomes popular we will seen vendors -embedding SMUX (or whatever multiplexing is used) in the kernel to -get around this stuff. There we go, Apache with a loadable kernel -module. - ----- - -Larry McVoy's document for SGI regarding sendfile/TransmitFile: -ftp://ftp.bitmover.com/pub/splice.ps.gz |