Keywords: Email Address | RFC 5322 | Character Validation
Abstract: This article provides an in-depth analysis of the allowed characters in the local-part and domain parts of email addresses, based on core standards such as RFC 5322 and RFC 5321, combined with internationalization and practical application scenarios. It covers ASCII character specifications, special character restrictions, internationalization extensions, and practical validation considerations, with code examples and detailed explanations to help developers correctly understand and implement email address validation.
Basic Structure of Email Addresses
Email addresses follow the standard format local-part@domain, where the character rules for the local-part and domain parts are defined by multiple RFC standards. Primary references include RFC 5322 (Internet Message Format) and RFC 5321 (Simple Mail Transfer Protocol), which ensure interoperability and global uniqueness. The local-part typically corresponds to the mailbox username, while the domain part identifies the mail server, with distinct character rules that require separate analysis.
Allowed Characters in the Local-Part
In its unquoted form, the local-part can use the following ASCII characters: uppercase and lowercase Latin letters (A-Z, a-z), digits (0-9), and special characters !#$%&'*+-/=?^_`{|}~. The dot . is allowed but cannot be the first or last character and cannot appear consecutively (e.g., John..Doe@example.com is invalid). In quoted strings, it can also include spaces, horizontal tabs, and characters "(),:;<>@[\], with backslashes and quotes requiring escaping. Comments can be added via parentheses, e.g., john.smith(comment)@example.com is equivalent to john.smith@example.com. The maximum length for the local-part is 64 octets, and practical applications must respect this limit.
Allowed Characters in the Domain Part
The domain part must conform to hostname rules, following the LDH (letters, digits, hyphen) specification: uppercase and lowercase Latin letters (A-Z, a-z, typically case-insensitive), digits (0-9), and hyphens - (which cannot be the first or last character). The domain can consist of multiple dot-separated labels, each up to 63 characters long. Additionally, the domain can be an IP address literal, such as jsmith@[192.168.1.2], though this is rarely used in practice. Internationalized Domain Names (IDNs) support non-ASCII characters via Punycode encoding, but are transmitted in ASCII-compatible form.
Special Characters and Quoting Rules
Certain characters, such as spaces, @, and parentheses, are only allowed in the local-part within quoted strings. For example, the address "John Doe"@example.com is valid, whereas John Doe@example.com is not. In quoted strings, backslashes are used to escape special characters, e.g., "very\"unusual"@example.com. Developers should note that many mail systems restrict special characters in practice to avoid compatibility issues.
Internationalization Extensions (EAI)
RFC 6531 and RFC 6532 support email address internationalization, allowing non-ASCII characters (e.g., emojis, non-Latin scripts) encoded in UTF-8 in both the local-part and domain. For instance, the address 我買@屋企.香港 is valid in systems supporting the SMTPUTF8 extension. However, this standard is still proposed and not universally implemented; practical applications should handle internationalized addresses cautiously and consider fallback mechanisms.
Practical Validation and Code Examples
Email address validation must combine character rules and length constraints. The following Python code example demonstrates basic validation logic based on RFCs, focusing on allowed characters in the local-part and domain:
import re
def validate_email_local_part(local_part):
# Check local-part length
if len(local_part) > 64:
return False
# Pattern for allowed characters: letters, digits, specified special chars, dots not consecutive or at ends
pattern = r'^[a-zA-Z0-9!#$%&\'*+/=?^_`{|}~-]+(\.[a-zA-Z0-9!#$%&\'*+/=?^_`{|}~-]+)*$'
if not re.match(pattern, local_part):
# Check for quoted string case
quoted_pattern = r'^"([^"\\]|\\.)*"$'
if not re.match(quoted_pattern, local_part):
return False
return True
def validate_email_domain(domain):
# Check domain length and label rules
if len(domain) > 255:
return False
labels = domain.split('.')
for label in labels:
if len(label) > 63 or not re.match(r'^[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?$', label):
return False
return True
def validate_email(email):
parts = email.split('@')
if len(parts) != 2:
return False
local_part, domain = parts
return validate_email_local_part(local_part) and validate_email_domain(domain)
# Test examples
print(validate_email("user.name@example.com")) # True
print(validate_email("user..name@example.com")) # False
print(validate_email("user@sub.example.com")) # TrueThis code covers basic character validation, but real-world systems should integrate more comprehensive checks, such as internationalization support and server-side validation.
Common Pitfalls and Best Practices
Many systems incorrectly reject valid addresses, such as those containing the + character (used for sub-addressing). It is advisable to adhere to RFC standards and avoid validation based on outdated knowledge. Best practices include using standard libraries for validation, considering internationalization needs, and testing edge cases (e.g., long addresses, special characters). Additionally, mail systems may impose further restrictions on the local-part, so developers should consult relevant documentation.
Conclusion
The character rules for email addresses are complex but well-defined, based on RFC standards to ensure global compatibility. The local-part supports a wide range of characters with quoting and length constraints, while the domain part strictly follows hostname rules. Internationalization extensions are gradually being adopted, but compatibility must be considered. Developers should implement validation logic according to the latest standards, avoid common errors, and support diverse email address formats.