Natural Sorting Algorithm: Correctly Sorting Strings with Numbers in Python

Keywords: Python | natural sorting | regex

Abstract: This article delves into the method of natural sorting (human sorting) for strings containing numbers in Python. By analyzing the core mechanisms of regex splitting and type conversion, it explains in detail how to achieve sorting by numerical value rather than lexicographical order. Complete code implementations for integers and floats are provided, along with discussions on performance optimization and practical applications.

Problem Background and Challenges

In programming practice, sorting strings that contain numbers is a common yet often overlooked issue. When using Python's built-in sort() method, strings are sorted lexicographically, which can lead to counterintuitive results. For example, given a list of strings ["something1", "something12", "something17", "something2", "something25", "something29"], calling sort() directly yields ["something1", "something12", "something17", "something2", "something25", "something29"], where "something12" comes before "something2" because the character '1' is lexicographically smaller than '2', ignoring that the number 12 is greater than 2.

Core Principles of Natural Sorting

Natural sorting, also known as human sorting, aims to mimic how humans sort mixed text and numbers. The core idea is to split the string into text segments and numeric segments, converting numbers to numerical types for comparison rather than treating them as strings. This ensures numbers are sorted by their numerical value, while text parts are still handled lexicographically.

Key steps in implementing natural sorting include:

Splitting the String: Use regular expressions to split the string into an alternating sequence of text and numbers. For example, "something12" is split into ["something", "12"].
Type Conversion: Convert numeric segments to integers or floats for numerical comparison.
Custom Sorting Key: Utilize the key parameter of Python's sort() method to return a list as the sorting basis.

Implementation for Integer Sorting

Below is an efficient implementation for integers, based on Toothy's optimized version:

import re

def atoi(text):
    return int(text) if text.isdigit() else text

def natural_keys(text):
    return [atoi(c) for c in re.split(r'(\d+)', text)]

alist = [
    "something1",
    "something12",
    "something17",
    "something2",
    "something25",
    "something29"
]
alist.sort(key=natural_keys)
print(alist)  # Output: ['something1', 'something2', 'something12', 'something17', 'something25', 'something29']

Code Analysis:

re.split(r'(\d+)', text) uses the regex (\d+) to match one or more digits, keeping the matched numbers in the result list as split points. For example, re.split(r'(\d+)', "something12") returns ["something", "12", ""] (note the trailing empty string).
The atoi function checks if the text is numeric: if yes, it converts to an integer; otherwise, it returns the original text. This ensures numbers are treated as numerical values during sorting.
The natural_keys function returns a list where each element is the converted text or number. Python's list comparison is element-wise, so numeric parts are compared by value.

Extension for Floating-Point Sorting

For strings containing floating-point numbers, adjust the regex to match float characters and modify the type conversion function accordingly:

import re

def atof(text):
    try:
        retval = float(text)
    except ValueError:
        retval = text
    return retval

def natural_keys(text):
    return [atof(c) for c in re.split(r'[+-]?([0-9]+(?:[.][0-9]*)?|[.][0-9]+)', text)]

alist = [
    "something1",
    "something2",
    "something1.0",
    "something1.25",
    "something1.105"
]
alist.sort(key=natural_keys)
print(alist)  # Output: ['something1', 'something1.0', 'something1.105', 'something1.25', 'something2']

Key Improvements:

The regex [+-]?([0-9]+(?:[.][0-9]*)?|[.][0-9]+) matches optional signs, integer parts with optional decimal parts, or pure decimals (e.g., ".5"). This covers common float formats.
The atof function uses a try-except block to attempt converting text to a float, returning the original text on failure, enhancing robustness.

Performance Analysis and Optimization

The performance of natural sorting primarily depends on the overhead of regex splitting and type conversion. Toothy's implementation significantly improves speed by simplifying logic and using isdigit() checks compared to the original version. In practical applications, for large datasets, consider precompiling regex patterns or using caching mechanisms for further optimization.

For example, precompiling regex patterns:

import re
int_pattern = re.compile(r'(\d+)')
float_pattern = re.compile(r'[+-]?([0-9]+(?:[.][0-9]*)?|[.][0-9]+)')
# Use pattern.split(text) in functions

Application Scenarios and Considerations

Natural sorting is widely used in file naming (e.g., "file1.txt", "file10.txt"), version number sorting (e.g., "v1.2", "v1.10"), and data cleaning scenarios. When implementing, note:

Ensure the regex matches all possible numeric formats, including negatives, scientific notation, etc., adjusting as needed.
Handle edge cases like empty strings or pure numeric strings.
Consider localization issues, such as different decimal separators in various regions.

Through this in-depth analysis, readers should master the core techniques of natural sorting and apply them flexibly in real-world projects to improve data processing quality and user experience.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.