At the time of writing this document, in October 2020, there are two major standards concerning Universal Resource Identifiers and Universal Resource Locators:
The former is a classical standard with a proper formal syntax, using the so
called
What is the difference between them, if any? They provide an overlapping
definition for resource identifiers and they are not compatible.
The
For a more complete problem
statement regarding the URIs check the
Let's start with what it is not. It is not the text that you type in the address bar in your Web browser. Web browsers do all possible heuristics to convert the input into a valid URI that could be sent over the network.
A URI is an identifier consisting of a sequence of characters matching the syntax
rule named
It is crucial to clarify that a character is a symbol that is displayed on a terminal or written to paper and should not be confused with its internal representation.
A URI more specifically, is a sequence of characters from a
subset of the US ASCII character set. The generic URI syntax consists of a
hierarchical sequence of components referred to as the scheme, authority,
path, query, and fragment. There is a formal description for
each of these components in
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] hier-part = "//" authority path-abempty / path-absolute / path-rootless / path-empty scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) authority = [ userinfo "@" ] host [ ":" port ] userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
As producing and consuming standard URIs can get quite complex, Erlang/OTP
provides
a module,
The API functions in
Let's say that we would like to create the following URI and send it over the
network:
1> uri_string:parse("http://cities/örebro?foo bar"). {error,invalid_uri,":"}
The URI parser tries all possible combinations to interpret the input and fails
at the last attempt when it encounters the colon character
The proper way to solve this problem is to use
2> uri_string:recompose(#{scheme => "http", host => "cities", path => "/örebro", query => "foo bar"}). "http://cities/%C3%B6rebro?foo%20bar"
The result is a valid URI where all the special characters are encoded as defined
by the standard. Applying
3> uri_string:percent_decode(uri_string:parse("http://cities/%C3%B6rebro?foo%20bar")). #{host => "cities",path => "/örebro",query => "foo bar", scheme => "http"}
This symmetric property is heavily used in our property test suite.
As you have seen in the previous chapter, a standard URI can only contain a strict
subset of the US ASCII character set, moreover the allowed set of characters is not
the same in the different URI components. Percent-encoding is a mechanism to
represent a data octet in a component when that octet's corresponding character
is outside of
the allowed set or is being used as a delimiter. This is what you see when
1> uri_string:allowed_characters().
If a URI component has a character that is not allowed, it will be percent-encoded when the URI is produced:
2> uri_string:recompose(#{scheme => "https", host => "local#host", path => ""}). "https://local%23host"
Consuming a URI containing percent-encoded triplets can take many steps. The
following example shows how to handle an input URI that is not normalized and
contains multiple percent-encoded triplets.
First, the input
3> uri_string:parse("http://%6C%6Fcal%23host/%F6re%26bro%20"). #{host => "%6C%6Fcal%23host",path => "/%F6re%26bro%20", scheme => "http"}}
The input is a valid URI but how can you decode those
percent-encoded octets? You can try to normalize the input with
4> uri_string:normalize("http://%6C%6Fcal%23host/%F6re%26bro%20"). "http://local%23host/%F6re%26bro%20" 5> uri_string:normalize("http://%6C%6Fcal%23host/%F6re%26bro%20", [return_map]). #{host => "local%23host",path => "/%F6re%26bro%20", scheme => "http"}
There are still a few percent-encoded triplets left in the output. At this point,
when the URI is already parsed, it is safe to apply application specific decoding on
the remaining character triplets. Erlang/OTP provides a function,
6> uri_string:percent_decode("local%23host"). "local#host" 7> uri_string:percent_decode("/%F6re%26bro%20"). >}]]> 8> uri_string:percent_decode(#{host => "local%23host",path => "/%F6re%26bro%20", scheme => "http"}). >}}}}]]>
The
9> uri_string:transcode("/%F6re%26bro%20", [{in_encoding, latin1}]). "/%C3%B6re%26bro%20" 10> uri_string:percent_decode("/%C3%B6re%26bro%20").
It is important to emphasize that it is not safe to apply
11> uri_string:percent_decode("http://%6C%6Fcal%23host/%C3%B6re%26bro%20"). uri_string:parse("http://local#host/öre&bro ").]]> {error,invalid_uri,":"}
Percent-encoding is implemented in
Normalization is the operation of converting the input URI into a canonical form and keeping the reference to the same underlying resource. The most common application of normalization is determining whether two URIs are equivalent without accessing their referenced resources.
Normalization has 6 distinct steps. First the input URI is parsed into an
intermediate form that can handle
Converts the
Decodes percent-encoded triplets that correspond to characters in the unreserved set.
Applying rules for the schemes http, https, ftp, ssh, sftp and tftp.
Converts the path into a canonical form.
After these steps, the intermediate data structure, an
Notice the order, the
13> uri_string:normalize("hTTp://LocalHost:80/%c3%B6rebro/a/../b"). "http://localhost/%C3%B6rebro/b" 14> uri_string:normalize("hTTp://LocalHost:80/%c3%B6rebro/a/../b", [return_map]). #{host => "localhost",path => "/%C3%B6rebro/b", scheme => "http"}
The current URI implementation provides support for producing and consuming standard URIs. The API is not meant to be directly exposed in a Web browser's address bar where users can basically enter free text. Application designers shall implement proper heuristics to map the input into a parsable URI.