Complete Set of Characters Allowed in URLs: From RFC Specifications to Internationalized Domain Names

Keywords: URL characters | RFC 3986 | percent-encoding | Internationalized Domain Names | IPv6 addresses

Abstract: This article provides an in-depth analysis of the complete set of characters allowed in URLs, based on the RFC 3986 specification. It details unreserved characters, reserved characters, and percent-encoding rules, with code examples for IPv6 addresses, hostnames, and query parameters. The discussion includes support for Internationalized Domain Names (IDN) with Chinese and Arabic characters, comparing outdated RFC 1738 with modern standards to offer a comprehensive guide for developers on URL character encoding.

Basic Concepts of URL Character Encoding

A URL (Uniform Resource Locator) is a standard format for addressing resources on the internet. According to the RFC 3986 specification, characters in a URL are categorized into unreserved characters, reserved characters, and those requiring percent-encoding. Unreserved characters can be used directly in a URL without encoding, including letters, digits, and specific symbols. For instance, in query parameters, letters A-Z, a-z, digits 0-9, and symbols such as "-", ".", "_", and "~" can be used as-is.

Detailed Explanation of RFC 3986 Specification

RFC 3986 is the current authoritative document for URL standards, replacing the obsolete RFC 1738. It defines character rules for components like hostnames, IP addresses, and paths. A hostname can be an IP-literal, IPv4 address, or reg-name (registered name). Reg-names allow unreserved characters, percent-encoded characters, and sub-delimiters. Unreserved characters include ALPHA (letters), DIGIT (digits), "-", ".", "_", and "~". Sub-delimiters comprise "!", "$", "&", "'", "(", ")", "*", "+", ",", ";", and "=". These characters can be used directly in specific contexts without encoding.

Character Handling in IPv6 and IPvFuture Addresses

IPv6 addresses are represented using hexadecimal digits and colons, for example, 2001:0db8:85a3:0000:0000:8a2e:0370:7334. In code, parsing IPv6 addresses can be based on the RFC 3986 syntax. Here is a simplified Python example demonstrating how to validate IPv6 address characters:

import re

ipv6_pattern = re.compile(
    r'^(([0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}|'  # Standard format
    r'::([0-9a-fA-F]{1,4}:){0,6}[0-9a-fA-F]{1,4}|'  # Compressed format
    r'([0-9a-fA-F]{1,4}:){1,7}:|'  # Other variants
    r'([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|'
    r'([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|'
    r'([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|'
    r'([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|'
    r'([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|'
    r'[0-9a-fA-F]{1,4}:(:[0-9a-fA-F]{1,4}){1,6}|'
    r':((:[0-9a-fA-F]{1,4}){1,7}|:))$'
)

def is_valid_ipv6(address):
    return bool(ipv6_pattern.match(address))

# Example usage
print(is_valid_ipv6('2001:0db8:85a3:0000:0000:8a2e:0370:7334'))  # Output: True
print(is_valid_ipv6('::1'))  # Output: True

This code uses a regular expression to check the IPv6 address format, ensuring only allowed characters (hexadecimal digits and colons) are used. IPvFuture addresses are similar but start with "v" followed by hexadecimal digits and dots, allowing unreserved characters and sub-delimiters.

Necessity of Percent-Encoding

For characters not in the unreserved or reserved sets, percent-encoding is required. For example, the space character is encoded as "%20" in URLs. In query strings, if a character is not unreserved, such as " " or "@", it must be encoded. The following Python example shows how to encode URL components:

from urllib.parse import quote

# Encode a query parameter
def encode_query_param(param):
    # Unreserved characters are not encoded; others are
    return quote(param, safe='')

# Example
original_param = 'user name@example'
encoded_param = encode_query_param(original_param)
print(encoded_param)  # Output: user%20name%40example

This code uses the quote function, where the safe parameter specifies characters not to encode (here, an empty string means all non-unreserved characters are encoded). This ensures URL compatibility and security.

Internationalized Domain Names (IDN) and Chinese Character Support

With the globalization of the internet, RFC 3986 supports Internationalized Domain Names (IDN), allowing non-ASCII characters like Chinese or Arabic in hostnames. IDN uses Punycode encoding to convert Unicode characters to an ASCII-compatible format. For instance, the Chinese domain "例子.中国" is encoded as "xn--fsq.xn--fiqs8s". In code, IDN libraries can handle these conversions:

import idna

# Encode a Chinese domain to ASCII
def encode_idn(domain):
    try:
        return idna.encode(domain).decode('ascii')
    except Exception as e:
        return str(e)

# Example
chinese_domain = '例子.中国'
encoded_domain = encode_idn(chinese_domain)
print(encoded_domain)  # Output: xn--fsq.xn--fiqs8s

This code uses the idna library to convert a Unicode domain to Punycode, ensuring proper usage in URLs. In query parameters, non-ASCII characters typically require percent-encoding; for example, the Chinese character "中" is encoded as "%E4%B8%AD".

Comparison with Obsolete RFC 1738

RFC 1738 was an early URL standard now considered obsolete. It allowed characters including alphanumerics and specific symbols like "$-_.+!*'(),", but its scope was limited and did not support modern needs such as IPv6 or IDN. RFC 3986 expands the character set and clarifies encoding rules, enhancing flexibility and international support. Developers should prioritize RFC 3986 to avoid compatibility issues.

Practical Application Recommendations

In web development, correctly using URL characters is crucial. For GET request query parameters, it is advisable to use only unreserved characters or encode reserved characters to prevent ambiguity. For example, when constructing URLs, use library functions for automatic encoding rather than manual concatenation. This reduces errors and improves code maintainability. In summary, understanding URL character rules aids in building robust, internationalized web applications.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.