Keywords: Pandas | Numeric Conversion | DataFrame | to_numeric | errors Parameter
Abstract: This article explores the replacement for the deprecated convert_objects(convert_numeric=True) function in Pandas 0.17.0, using df.apply(pd.to_numeric) with the errors parameter to handle non-numeric columns in a DataFrame. Through code examples and step-by-step explanations, it demonstrates how to perform numeric conversion while preserving non-numeric columns, providing an elegant method to replicate the functionality of the deprecated function.
Background and Problem Description
In Pandas 0.17.0 and later versions, the convert_objects(convert_numeric=True) function has been deprecated, necessitating a new approach to convert convertible values in an entire DataFrame to numeric types. The existing pd.to_numeric function only applies to single Series, not entire DataFrames, thus requiring a more efficient solution. This article introduces how to use df.apply(pd.to_numeric) to achieve similar functionality and handle cases with non-numeric columns.
Basic Method: Using df.apply(pd.to_numeric)
When all columns in a DataFrame can be converted to numeric types, df.apply(pd.to_numeric) can be used directly. Below is an example demonstrating how to convert string numbers and floats to their respective numeric types.
>>> df = pd.DataFrame({'a': ['1', '2'],
'b': ['45.8', '73.9'],
'c': [10.5, 3.7]})
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
a 2 non-null object
b 2 non-null object
c 2 non-null float64
dtypes: float64(1), object(2)
memory usage: 64.0+ bytes
>>> df.apply(pd.to_numeric).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
a 2 non-null int64
b 2 non-null float64
c 2 non-null float64
dtypes: float64(2), int64(1)
memory usage: 64.0 bytesIn this example, columns 'a' and 'b' are converted from object types to int64 and float64, respectively, while column 'c' remains float64. This method is effective when all columns are convertible, but it causes errors when the DataFrame contains non-numeric columns.
Handling Non-Numeric Columns: Using the errors Parameter
When a DataFrame contains columns that cannot be converted to numeric types, the errors parameter of pd.to_numeric must be used. This parameter has three options: 'raise' (default, raises an exception), 'coerce' (sets invalid values to NaN), and 'ignore' (returns the original input). In this context, using 'ignore' allows non-numeric columns to remain unchanged. Below is an example demonstrating how to perform numeric conversion in a DataFrame with string columns.
>>> df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
>>> df.apply(pd.to_numeric, errors='ignore').info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
Words 2 non-null object
ints 2 non-null int64
dtypes: int64(1), object(1)
memory usage: 48.0+ bytesIn this example, the errors='ignore' parameter causes pd.to_numeric to return the original value when it encounters a non-convertible column, thus converting the 'ints' column to int64 while keeping the 'Words' column as object. This method provides an elegant way to handle mixed-type DataFrames.
Alternative Solutions Comparison
In addition to directly passing the errors='ignore' parameter, functools.partial can be used to create a partially applied function. This approach is more verbose but may be useful in certain scenarios. Below is an example using partial.
>>> from functools import partial
>>> df = pd.DataFrame({'ints': ['3', '5'],
'Words': ['Kobe', 'Bryant']})
>>> df.apply(partial(pd.to_numeric, errors='ignore')).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
Words 2 non-null object
ints 2 non-null int64
dtypes: int64(1), object(1)
memory usage: 48.0+ bytesThis method yields the same result as directly passing the parameter but is more suitable for scenarios where a function with fixed parameters needs to be reused. However, in most cases, using errors='ignore' directly is more concise and readable.
Conclusion and Recommendations
In Pandas 0.17.0, the deprecated convert_objects(convert_numeric=True) function can be replaced with df.apply(pd.to_numeric, errors='ignore'). This method allows users to convert convertible values to numeric types while preserving non-numeric columns. It provides a flexible approach to handle mixed-type DataFrames, avoiding exceptions. It is recommended to use this method in practical applications to ensure code maintainability and compatibility.