URL Encoding Internals: RFC 3986, Unicode, and the Base64 Connection
A deep dive into percent-encoding. Understand the history of URI characters, how UTF-8 bytes are encoded, why encodeURIComponent differs from encodeURI, and how XSS attacks exploit decoding layers.
📋 Indice dei Contenuti
The Uniform Resource Locator (URL) was designed in the early 1990s as a subset of ASCII. To ensure safe transit across thousands of disparate routers, proxies, and gateways, URLs are strictly limited to a small pool of safe characters.
When developers need to pass arbitrary data—such as spaces, ampersands, or Unicode emojis—through a URL query parameter, that data must be serialized into the safe ASCII character set. This mechanism is known formally as percent-encoding.
RFC 3986: Reserved vs. Unreserved
According to the definitive standard RFC 3986, characters are divided into two primary buckets:
- Unreserved Characters: Alpha-numerics (A-Z, a-z, 0-9), hyphens (
-), periods (.), underscores (_), and tildes (~). These are always safe and never need encoding. - Reserved Characters: Characters that possess semantic meaning within the URL structure itself. For example,
?denotes the query string,&separates parameters, and/denotes path boundaries.
If you intend to send the literal character & as data, you must encode it to %26 so the server does not misinterpret it as a structural delimiter.
The Math of Percent-Encoding
The encoding process is mathematically straightforward. The system takes the byte value of the character, converts it to a two-digit hexadecimal representation, and prefixes it with a percent sign (%).
For example, the ASCII value for a Space is 32 (decimal). In hexadecimal, 32 is 20. Thus, a space becomes %20. The ampersand (&) is decimal 38, hex 26, becoming %26.
The UTF-8 Complexity
ASCII only defines 128 characters. What happens when you need to send the Japanese character "本" or a fire emoji "🔥"? Modern percent-encoding dictates that the string must first be converted into a UTF-8 byte array, and then each byte is percent-encoded sequentially.
The fire emoji "🔥" (U+1F525) requires 4 bytes in UTF-8: F0 9F 94 A5. Therefore, the properly URL-encoded representation is %F0%9F%94%A5. Server-side frameworks like Express or Spring Boot automatically decode this byte array back into the native string representation during request parsing.
encodeURI vs. encodeURIComponent
In JavaScript, a widespread source of bugs is the confusion between the two native encoding APIs:
encodeURI(): Used to encode a complete URL. It assumes structural characters like/,?, and=are intentional and leaves them unencoded.encodeURIComponent(): Used to encode a single parameter value. It aggressively encodes almost everything, including/and?.
If you use encodeURI on a query parameter that contains an ampersand, it will fail to encode it, breaking the URL structure and causing data truncation on the server.
The + vs. %20 Space Debate
Why do spaces sometimes encode as %20 and sometimes as +? It depends on the context of the HTTP request.
In standard URL paths (e.g., /my%20folder/file), spaces must be %20. However, when an HTML form is submitted using the application/x-www-form-urlencoded content type, the legacy HTML standard dictates that spaces are encoded as the plus sign (+). Modern APIs overwhelmingly prefer %20 everywhere to avoid ambiguity.
Double Decoding and XSS Attacks
Encoding is intimately tied to web security. A common Cross-Site Scripting (XSS) vector involves passing a malicious payload like <script> through the URL. Web Application Firewalls (WAFs) easily block this.
However, attackers bypass basic WAFs using double-encoding. The payload %253Cscript%253E sneaks past the firewall. If the backend application naïvely runs a decode function twice (e.g., the framework decodes it once, and then the developer manually calls decode again), the malicious payload is executed. Never decode data more than once per layer.
Debugging Encoded URLs
When troubleshooting OAuth callback URIs or complex deep links, visually parsing deeply nested percent-encoding is incredibly error-prone. Our URL Encoder / Decoder tool safely handles the conversion process.
It correctly parses multi-byte UTF-8 sequences and ensures that query parameters containing JSON objects or Base64 tokens are serialized flawlessly before making HTTP requests.
Karuvigal Team
Building developer tools that save time and improve productivity.