lib/stdlib/doc/src/uri_string_usage.xml


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE chapter SYSTEM "chapter.dtd">

<chapter>
  <header>
    <copyright>
      <year>2022</year>
      <holder>Ericsson AB. All Rights Reserved.</holder>
    </copyright>
    <legalnotice>
      Licensed under the Apache License, Version 2.0 (the "License");
      you may not use this file except in compliance with the License.
      You may obtain a copy of the License at

          http://www.apache.org/licenses/LICENSE-2.0

      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License.

    </legalnotice>

    <title>Uniform Resource Identifiers</title>
    <prepared>Péter Dimitrov</prepared>
    <responsible></responsible>
    <docno></docno>
    <approved></approved>
    <checked></checked>
    <date>2020-09-30</date>
    <rev>PA1</rev>
    <file>uri_string_usage.xml</file>
  </header>
  <section>
    <title>Basics</title>
    <p>At the time of writing this document, in October 2020, there are
    two major standards concerning Universal Resource Identifiers and
    Universal Resource Locators:</p>
    <list type="bulleted">
      <item><p>
	<url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986 - Uniform Resource
      Identifier (URI): Generic Syntax</url></p></item>
      <item><p>
	<url href="https://url.spec.whatwg.org/">WHAT WG URL - Living standard</url>
      </p></item>
    </list>
    <p>
    The former is a classical standard with a proper formal syntax, using the so
    called <url href="https://www.ietf.org/rfc/rfc2234.txt">Augmented Backus-Naur Form
    (ABNF)</url> for describing
    the grammar, while the latter is a living document describing the current pratice,
    that is, how a majority of Web browsers work with URIs. WHAT WG URL is Web focused
    and it has no formal grammar but a plain english description of the algorithms
    that should be followed.</p>
    <p>What is the difference between them, if any? They provide an overlapping
    definition for resource identifiers and they are not compatible.
    The <seeerl marker="stdlib:uri_string"><c>uri_string</c></seeerl> module implements
    <url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986</url> and the term URI will
    be used throughout this document. A URI is an identifier, a string of characters
    that identifies a particular resource.</p>
    <p>
    For a more complete problem
    statement regarding the URIs check the
    <url href="https://tools.ietf.org/html/draft-ruby-url-problem-01">URL Problem
    Statement and Directions</url>.</p>
  </section>

  <section>
    <title>What is a URI?</title>
    <p>Let's start with what it is not. It is not the text that you type in the address
    bar in your Web browser. Web browsers do all possible heuristics to convert the
    input into a valid URI that could be sent over the network.</p>
    <p>A URI is an identifier consisting of a sequence of characters matching the syntax
    rule named <c>URI</c> in
    <url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986</url>.
    </p>
    <p>It is crucial to clarify that a <i>character</i> is a symbol that is displayed on
    a terminal or written to paper and should not be confused with its internal
    representation.</p>
    <p>A URI more specifically, is a sequence of characters from a
    subset of the US ASCII character set. The generic URI syntax consists of a
    hierarchical sequence of components referred to as the scheme, authority,
    path, query, and fragment. There is a formal description for
    each of these components in
    <url href="https://www.ietf.org/rfc/rfc2234.txt">ABNF</url> notation in
    <url href="https://www.ietf.org/rfc/rfc3986.txt">RFC 3986</url>:</p>
    <pre>
    URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
    hier-part   = "//" authority path-abempty
                   / path-absolute
                   / path-rootless
                   / path-empty
    scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
    authority   = [ userinfo "@" ] host [ ":" port ]
    userinfo    = *( unreserved / pct-encoded / sub-delims / ":" )

    reserved    = gen-delims / sub-delims
    gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    sub-delims  = "!" / "$" / "&amp;" / "'" / "(" / ")"
                / "*" / "+" / "," / ";" / "="

    unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
    </pre>
  </section>

  <section>
    <title>The uri_string module</title>
    <p>As producing and consuming standard URIs can get quite complex, Erlang/OTP
    provides
    a module, <seeerl marker="stdlib:uri_string"><c>uri_string</c></seeerl>, to handle all the most difficult operations such as parsing,
    recomposing, normalizing and resolving URIs against a base URI.
    </p>
    <p>The API functions in <seeerl marker="stdlib:uri_string"><c>uri_string</c></seeerl>
    work on two basic data types
    <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype> and
    <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>.
    <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype> represents a
    standard URI, while
    <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype> is a wider datatype,
    that can represent URI components using
    <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide> characters.
    <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>
    is a convenient choice for enabling
    operations such as producing standard compliant URIs out of components that have
    special or <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide>
    characters. It is easier to explain this by an example.
    </p>
    <p>Let's say that we would like to create the following URI and send it over the
    network: <c>http://cities/örebro?foo bar</c>. This is not a valid URI as it contains
    characters that are not allowed in a URI such as "ö" and the space. We can verify
    this by parsing the URI:
  </p>
  <pre>
  1> uri_string:parse("http://cities/örebro?foo bar").
  {error,invalid_uri,":"}
  </pre>
  <p>The URI parser tries all possible combinations to interpret the input and fails
  at the last attempt when it encounters the colon character <c>":"</c>. Note, that
  the inital fault occurs when the parser attempts to interpret the character
  <c>"ö"</c> and after a failure back-tracks to the point where it has another
  possible parsing alternative.</p>
  <p>The proper way to solve this problem is to use
  <seemfa marker="uri_string#recompose/1"><c>uri_string:recompose/1</c></seemfa>
  with a <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype> as input:</p>
  <pre>
  2> uri_string:recompose(#{scheme => "http", host => "cities", path => "/örebro",
  query => "foo bar"}).
  "http://cities/%C3%B6rebro?foo%20bar"
  </pre>
  <p>The result is a valid URI where all the special characters are encoded as defined
  by the standard. Applying
  <seemfa marker="uri_string#parse/1"><c>uri_string:parse/1</c></seemfa> and
  <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa>
  on the URI returns the original input:
  </p>
  <pre>
  3> uri_string:percent_decode(uri_string:parse("http://cities/%C3%B6rebro?foo%20bar")).
  #{host => "cities",path => "/örebro",query => "foo bar",
  scheme => "http"}
  </pre>
  <p>This symmetric property is heavily used in our property test suite.
  </p>
  </section>

  <section>
    <title>Percent-encoding</title>
    <p>As you have seen in the previous chapter, a standard URI can only contain a strict
    subset of the US ASCII character set, moreover the allowed set of characters is not
    the same in the different URI components. Percent-encoding is a mechanism to
    represent a data octet in a component when that octet's corresponding character
    is outside of
    the allowed set or is being used as a delimiter. This is what you see when <c>"ö"</c>
    is encoded as <c>%C3%B6</c> and <c>space</c> as <c>%20</c>.
    Most of the API functions are
    expecting UTF-8 encoding when handling percent-encoded triplets. The UTF-8 encoding
    of the <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide>
    character <c>"ö"</c> is two octets: <c>OxC3 0xB6</c>.
    The character <c>space</c> is in the first 128 characters of
    <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide> and it is encoded
    using a single octet <c>0x20</c>.</p>
    <note><p><seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide>
    is backward compatible with ASCII, the encoding of the first 128
    characters is the same binary value as in ASCII.
    </p></note>
    <p><marker id="percent_encoding"></marker>
    It is a major source of confusion exactly which characters will be
    percent-encoded. In order to make it easier to answer this question the library
    provides a utility function,
    <seemfa marker="uri_string#allowed_characters/0"><c>uri_string:allowed_characters/0
    </c></seemfa>,
    that lists the allowed set of characters in each major
    URI component, and also in the most important standard character sets.
    </p>
    <pre>
    1> uri_string:allowed_characters().
    <![CDATA[{scheme,
     "+-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"},
    {userinfo,
     "!$%&'()*+,-.0123456789:;=ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
    {host,
     "!$&'()*+,-.0123456789:;=ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
    {ipv4,".0123456789"},
    {ipv6,".0123456789:ABCDEFabcdef"},
    {regname,
     "!$%&'()*+,-.0123456789;=ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
    {path,
     "!$%&'()*+,-./0123456789:;=@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
    {query,
     "!$%&'()*+,-./0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
    {fragment,
     "!$%&'()*+,-./0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"},
    {reserved,"!#$&'()*+,/:;=?@[]"},
    {unreserved,
     "-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~"}] ]]>
    </pre>
    <p>If a URI component has a character that is not allowed, it will be
    percent-encoded when the URI is produced:
    </p>
    <pre>
    2> uri_string:recompose(#{scheme => "https", host => "local#host", path => ""}).
    "https://local%23host"
    </pre>
    <p>Consuming a URI containing percent-encoded triplets can take many steps. The
    following example shows how to handle an input URI that is not normalized and
    contains multiple percent-encoded triplets.
    First, the input <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype>
    is to be parsed into a <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>.
    The parsing only splits the URI into its components without doing any decoding:
    </p>
    <pre>
    3> uri_string:parse("http://%6C%6Fcal%23host/%F6re%26bro%20").
    #{host => "%6C%6Fcal%23host",path => "/%F6re%26bro%20",
      scheme => "http"}}
    </pre>
    <p>The input is a valid URI but how can you decode those
    percent-encoded octets? You can try to normalize the input with
    <seemfa marker="uri_string#normalize/1"><c>uri_string:normalize/1</c></seemfa>. The
    normalize operation decodes those
    percent-encoded triplets that correspond to a character in the unreserved set.
    Normalization is a safe, idempotent operation that converts a URI into its
    canonical form:</p>
    <pre>
    4> uri_string:normalize("http://%6C%6Fcal%23host/%F6re%26bro%20").
    "http://local%23host/%F6re%26bro%20"
    5> uri_string:normalize("http://%6C%6Fcal%23host/%F6re%26bro%20", [return_map]).
    #{host => "local%23host",path => "/%F6re%26bro%20",
      scheme => "http"}
    </pre>
    <p>There are still a few percent-encoded triplets left in the output. At this point,
    when the URI is already parsed, it is safe to apply application specific decoding on
    the remaining character triplets. Erlang/OTP provides a function,
    <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa>
    for raw percent decoding
    that you can use on the host and path components, or on the whole map:
    </p>
    <pre>
    6> uri_string:percent_decode("local%23host").
    "local#host"
    7> uri_string:percent_decode("/%F6re%26bro%20").
    <![CDATA[{error,invalid_utf8,<<"/öre&bro ">>}]]>
    8> uri_string:percent_decode(#{host => "local%23host",path => "/%F6re%26bro%20",
    scheme => "http"}).
    <![CDATA[{error,{invalid,{path,{invalid_utf8,<<"/öre&bro ">>}}}}]]>
    </pre>
    <p>The <c>host</c> was successfully decoded but the path contains at least one
    character with
    non-UTF-8 encoding. In order to be able to decode this, you have to make assumptions
    about the encoding used in these triplets. The most obvious choice is
    <i>latin-1</i>, so you can try
    <seemfa marker="uri_string#transcode/2"><c>uri_string:transcode/2</c></seemfa>, to
    transcode the path to UTF-8 and run the percent-decode operation on the
    transcoded string:
    </p>
    <pre>
    9> uri_string:transcode("/%F6re%26bro%20", [{in_encoding, latin1}]).
    "/%C3%B6re%26bro%20"
    10> uri_string:percent_decode("/%C3%B6re%26bro%20").
    <![CDATA["/öre&bro "]]>
    </pre>
    <p>It is important to emphasize that it is not safe to apply
    <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa>
    directly on an input URI:
    </p>
    <pre>
    11> uri_string:percent_decode("http://%6C%6Fcal%23host/%C3%B6re%26bro%20").
    <![CDATA["http://local#host/öre&bro "
    12> uri_string:parse("http://local#host/öre&bro ").]]>
    {error,invalid_uri,":"}
    </pre>
    <note><p>Percent-encoding is implemented in
    <seemfa marker="uri_string#recompose/1"><c>uri_string:recompose/1</c></seemfa>
    and it happens when converting a
    <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>
    into a <seetype marker="uri_string#uri_string"><c>uri_string()</c></seetype>.
    Applying any percent-encoding directly on an input URI would not be safe just as in
    the case of
    <seemfa marker="uri_string#percent_decode/1"><c>uri_string:percent_decode/1</c></seemfa>,
    the output could be an invalid URI.
    Quoting functions allow users to perform raw percent encoding and decoding
    on application data which cannot be handled automatically by
    <c>uri_string:recompose/1</c>. For example in scenario when user would
    need to use '/' or sub-delimeter as data rather than delimeter in a path component.
    </p>
    </note>
  </section>

  <section>
    <title>Normalization</title>
    <p>Normalization is the operation of converting the input URI into a <i>canonical</i>
    form and keeping the reference to the same underlying resource. The most common
    application of normalization is determining whether two URIs are equivalent
    without accessing their referenced resources.</p>
    <p>Normalization has 6 distinct steps. First the input URI is parsed into an
    intermediate form that can handle
    <seeguide marker="unicode_usage#what-unicode-is">Unicode</seeguide> characters.
    This datatype is the
    <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>, that can hold the
    components of the URI in map elements of type
    <seetype marker="unicode#chardata"><c>unicode:chardata()</c></seetype>.
    After having the intermediate form, a sequence of
    normalization algorithms are applied to the individual URI components:</p>
    <taglist>
      <tag>Case normalization</tag>
      <item>
	<p>Converts the <c>scheme</c> and <c>host</c> components
	to lower case as they are not case sensitive.</p>
      </item>
      <tag>Percent-encoding normalization</tag>
      <item>
	<p>Decodes percent-encoded triplets that
	correspond to characters in the unreserved set.</p>
      </item>
      <tag>Scheme-based normalization</tag>
      <item>
	<p>Applying rules for the schemes http, https,
	ftp, ssh, sftp and tftp.</p>
      </item>
      <tag>Path segment normalization</tag>
      <item>
	<p>Converts the path into a canonical form.</p>
      </item>
    </taglist>
    <p>After these steps, the intermediate data structure, an
    <seetype marker="uri_string#uri_map"><c>uri_map()</c></seetype>,
    is fully normalized. The last step is applying
    <seemfa marker="uri_string#recompose/1"><c>uri_string:recompose/1</c></seemfa>
    that converts the intermediate structure into a valid canonical URI string.</p>
    <p>Notice the order, the
    <seemfa marker="uri_string#normalize/2"><c>uri_string:normalize(URIMap, [return_map])</c></seemfa> that we
    used many times in this user guide is a shortcut in the normalization process
    returning the intermediate datastructure, and allowing us to inspect and apply
    further decoding on the remaining percent-encoded triplets.</p>
    <pre>
    13> uri_string:normalize("hTTp://LocalHost:80/%c3%B6rebro/a/../b").
    "http://localhost/%C3%B6rebro/b"
    14> uri_string:normalize("hTTp://LocalHost:80/%c3%B6rebro/a/../b", [return_map]).
    #{host => "localhost",path => "/%C3%B6rebro/b",
      scheme => "http"}
    </pre>
  </section>

 <section>
   <title>Special considerations</title>
   <p>The current URI implementation provides support for producing and consuming
   standard URIs. The API is not meant to be directly exposed in a Web
   browser's address bar where users can basically enter free text. Application
   designers shall implement proper heuristics to map the input into a parsable URI.</p>
 </section>

</chapter>