diff options
-rw-r--r-- | docs/Makefile.am | 1 | ||||
-rw-r--r-- | docs/URL-SYNTAX.md | 316 |
2 files changed, 317 insertions, 0 deletions
diff --git a/docs/Makefile.am b/docs/Makefile.am index 9cf657748..6a79f82ec 100644 --- a/docs/Makefile.am +++ b/docs/Makefile.am @@ -86,6 +86,7 @@ EXTRA_DIST = \ THANKS \ TODO \ TheArtOfHttpScripting.md \ + URL-SYNTAX.md \ VERSIONS.md MAN2HTML= roffit $< >$@ diff --git a/docs/URL-SYNTAX.md b/docs/URL-SYNTAX.md new file mode 100644 index 000000000..5277ef4f4 --- /dev/null +++ b/docs/URL-SYNTAX.md @@ -0,0 +1,316 @@ +# URL syntax and their use in curl + +## Specifications + +The official "URL syntax" is primarily defined in these two different +specifications: + + - [RFC 3986](https://tools.ietf.org/html/rfc3986) (although URL is called "URI" in there) + - [The WHATWG URL Specification](https://url.spec.whatwg.org/) + +RFC 3986 is the earlier one, and curl has always tried to adhere to that one +(since it shipped in January 2005). + +The WHATWG URL spec was written later, is incompatible with the RFC 3986 and +changes over time. + +## Variations + +URL parsers as implemented in browsers, libraries and tools usually opt to +support one of the mentioned specifications. Bugs, differences in +interpretations and the moving nature of the WHATWG spec does however make it +very unlikely that multiple parsers treat URLs the exact same way! + +## Security + +Due to the inherent differences between URL parser implementations, it is +considered a security risk to mix different implementations and assume the +same behavior! + +For example, if you use one parser to check if a URL uses a good host name or +the correct auth field, and then pass on that same URL to a *second* parser, +there will always be a risk it treats the same URL differently. There is no +right and wrong in URL land, only differences of opinions. + +libcurl offers a separate API to its URL parser for among others, this reason. + +Applications may at times find it convenient to allow users to specify URLs +for various purposes and that string would then end up fed to curl. Getting a +URL from an external untrusted party and using it with curl brings several +security concerns: + +1. If you have an application that runs as or in a server application, getting + an unfiltered URL can trick your application to access a local resource + instead of a remote. Protecting yourself against localhost accesses is very + hard when accepting user provided URLs. + +2. Such custom URLs can access other ports than you planned as port numbers + are part of the regular URL format. The combination of a local host and a + custom port number can allow external users to play tricks with your local + services. + +3. Such a URL might use other schemes than you thought of or planned for. + +## "RFC3986 plus" + +curl recognizes a URL syntax that we call "RFC 3986 plus". It is grounded on +the well established RFC 3986 to make sure previously written command lines and +curl using scripts will remain working. + +curl's URL parser allows a few deviations from the spec in order to +inter-operate better with URLs that appear in the wild. + +### spaces + +In particular `Location:` headers that indicate to the client where a resource +has been redirected to, sometimes contain spaces. This is a violation of RFC +3986 but is fine in the WHATWG spec. curl handles these by re-encoding them to +`%20`. + +### non-ASCII + +Byte values in a provided URL that are outside of the printable ASCII range +are percent-encoded by curl. + +### multiple slashes + +An absolute URL always starts with a "scheme" followed by a colon. For all the +schemes curl supports, the colon must be followed by two slashes according to +RFC 3986 but not according to the WHATWG spec - which allows one to infinity +amount. + +curl allows one, two or three slashes after the colon to still be considered a +valid URL. + +### "scheme-less" + +curl supports "URLs" that do not start with a scheme. This is not supported by +any of the specifications. This is a shortcut to entering URLs that was +supported by browsers early on and has been mimicked by curl. + +Based on what the host name starts with, curl will "guess" what protocol to +use: + + - `ftp.` means FTP + - `dict.` means DICT + - `ldap.` means LDAP + - `imap.` means IMAP + - `smtp.` means SMTP + - `pop3.` means POP3 + - all other means HTTP + +### globbing letters + +The curl command line tool supports "globbing" of URLs. It means that you can +create ranges and lists using `[N-M]` and `{one,two,three}` sequences. The +letters used for this (`[]{}`) are reserved in RFC 3986 and can therefore not +legitimately be part of such a URL. + +They are however not reserved or special in the WHATWG specification, so +globbing can mess up such URLs. Globbing can be turned off for such occasions +(using `--globoff`). + +# URL syntax details + +A URL may consist of the following components - many of them are optional: + + [scheme][divider][userinfo][hostname][port number][path][query][fragment] + +Each component is separated from the following component with a divider +character or string. + +Which in an example could look like + + http://user:password@www.example.com:80/index.hmtl?foo=bar#top + +## Scheme + +The scheme specifies the protocol to use. A curl build can support a few or +many different schemes. You can limit what schemes curl should acccept. + +## Userinfo + +The userinfo field can be used to set user name and password for this +authentication purposes in this transfer. The use of this field is discouraged +since it often means passing around the password in plain text and is thus a +security risk. + +URLs for IMAP, POP3 and SMTP also support *login options* as part of the +userinfo field. They're provided as a semicolon after the password and then +the options. + +## Hostname + +The hostname part of the URL contains the address of the server that you want +to connect to. This can be the fully qualified domain name of the server, the +local network name of the machine on your network or the IP address of the +server or machine represented by either an IPv4 or IPv6 address (within +brackets). For example: + + http://www.example.com/ + + http://hostname/ + + http://192.168.0.1/ + + http://[2001:1890:1112:1::20]/ + +If curl was built with International Domain Name (IDN) support, it can also +handle host names using non-ASCII characters. + +## Port number + +If there's a colon after the hostname, that should be followed by the port +number to use. 1 - 65535. curl also supports a blank port number field - but +only if the URL starts with a scheme. + +# Scheme specific behaviors + +## FTP + +The path part of an FTP request specifies the file to retrieve and from what +directory. If the file part is omitted then libcurl downloads the directory +listing for the directory specified. If the directory is omitted then the +directory listing for the root / home directory will be returned. + +FTP servers typically put the user in its "home directory" after login, which +then differs between users. To explicitly specify the root directory of an FTP +server start the path with double slash `//` or `/%2f` (2F is the hexadecimal +value of the ascii code for the slash). + +## FILE + +When a `FILE://` URL is accessed on Windows systems, it can be crafted in a +way so that Windows attempts to connect to a (remote) machine when curl wants +to read or write such a path. + +curl only allows the hostname part of a FILE URL to be one out of these three +alternatives: `localhost`, `127.0.0.1` or blank ("", zero characters). +Anything else will make curl fail to parse the URL. + +On Windows, curl accepts that the FILE URL's path starts with a "drive +letter". That's a single letter `a` to `z` followed by a colon or a pipe +character (`|`). + +## IMAP + +The path part of an IMAP request not only specifies the mailbox to list or +select, but can also be used to check the `UIDVALIDITY` of the mailbox, to +specify the `UID`, `SECTION` and `PARTIAL` octets of the message to fetch and +to specify what messages to search for. + +A top level folder list: + + imap://user:password@mail.example.com + +A folder list on the user's inbox: + + imap://user:password@mail.example.com/INBOX + +Select the user's inbox and fetch message with uid = 1: + + imap://user:password@mail.example.com/INBOX/;UID=1 + +Select the user's inbox and fetch the first message in the mail box: + + imap://user:password@mail.example.com/INBOX/;MAILINDEX=1 + +Select the user's inbox, check the `UIDVALIDITY` of the mailbox is 50 and +fetch message 2 if it is: + + imap://user:password@mail.example.com/INBOX;UIDVALIDITY=50/;UID=2 + +Select the user's inbox and fetch the text portion of message 3: + + imap://user:password@mail.example.com/INBOX/;UID=3/;SECTION=TEXT + +Select the user's inbox and fetch the first 1024 octets of message 4: + + imap://user:password@mail.example.com/INBOX/;UID=4/;PARTIAL=0.1024 + +Select the user's inbox and check for NEW messages: + + imap://user:password@mail.example.com/INBOX?NEW + +Select the user's inbox and search for messages containing "shadows" in the +subject line: + + imap://user:password@mail.example.com/INBOX?SUBJECT%20shadows + +For more information about the individual components of an IMAP URL please see +RFC 5092. + +## LDAP + +The path part of a LDAP request can be used to specify the: Distinguished +Name, Attributes, Scope, Filter and Extension for a LDAP search. Each field is +separated by a question mark and when that field is not required an empty +string with the question mark separator should be included. + +Search for the DN as `My Organisation`: + + ldap://ldap.example.com/o=My%20Organisation + +the same search but will only return postalAddress attributes: + + ldap://ldap.example.com/o=My%20Organisation?postalAddress + +Seearch for an empty DN and request information about the +`rootDomainNamingContext` attribute for an Active Directory server: + + ldap://ldap.example.com/?rootDomainNamingContext + +For more information about the individual components of a LDAP URL please +see RFC4516. + +## POP3 + +The path part of a POP3 request specifies the message ID to retrieve. If the +ID is not specified then a list of waiting messages is returned instead. + +## SCP + +The path part of an SCP URL specifies the path and file to retrieve or +upload. The file is taken as an absolute path from the root directory on the +server. + +To specify a path relative to the user's home directory on the server, prepend +`~/` to the path portion. + +## SFTP + +The path part of an SFTP URL specifies the file to retrieve or upload. If the +path ends with a slash (`/`) then a directory listing is returned instead of a +file. If the path is omitted entirely then the directory listing for the root +/ home directory will be returned. + +## SMB +The path part of a SMB request specifies the file to retrieve and from what +share and directory or the share to upload to and as such, may not be omitted. +If the user name is embedded in the URL then it must contain the domain name +and as such, the backslash must be URL encoded as %2f. + +curl supports SMB version 1 (only) + +## SMTP + +The path part of a SMTP request specifies the host name to present during +communication with the mail server. If the path is omitted then libcurl will +attempt to resolve the local computer's host name. However, this may not +return the fully qualified domain name that is required by some mail servers +and specifying this path allows you to set an alternative name, such as your +machine's fully qualified domain name, which you might have obtained from an +external function such as gethostname or getaddrinfo. + +## RTMP + +There's no official URL spec for RTMP so libcurl uses the URL syntax supported +by the underlying librtmp library. It has a syntax where it wants a +traditional URL, followed by a space and a series of space-separated +`name=value` pairs. + +While space is not typically a "legal" letter, libcurl accepts them. When a +user wants to pass in a `#` (hash) character it will be treated as a fragment +and get cut off by libcurl if provided literally. You will instead have to +escape it by providing it as backslash and its ASCII value in hexadecimal: +`\23`. |