Flattening Multilevel Nested JSON: From pandas json_normalize to Custom Recursive Functions

Keywords: JSON flattening | Python | pandas | recursive function | data conversion

Abstract: This paper delves into methods for flattening multilevel nested JSON data in Python, focusing on the limitations of the pandas library's json_normalize function and detailing the implementation and applications of custom recursive functions based on high-scoring Stack Overflow answers. By comparing different solutions, it provides a comprehensive technical pathway from basic to advanced levels, helping readers select appropriate methods to effectively convert complex JSON structures into flattened formats suitable for CSV output, thereby supporting further data analysis.

In data science and engineering, JSON is widely used as a lightweight data interchange format in API responses, configuration files, and data storage. However, when JSON structures contain multilevel nested dictionaries and lists, directly converting them into tabular forms (e.g., CSV) for data analysis becomes complex. Based on a typical question from Stack Overflow, this paper explores how to effectively flatten multilevel nested JSON, with particular attention to the pandas library's json_normalize() function and its alternatives.

Problem Background and Challenges

A user faced challenges with nested JSON structures when processing virtual machine data returned by the CloudStack API. The original data includes multiple levels of nesting, such as securitygroup and nic fields. When converting to CSV, using only json_normalize() flattens only the first level, causing nested parts to remain in JSON format and failing to meet the requirement for single-line input. An example JSON structure is as follows:

{
    "count": 13,
    "virtualmachine": [
        {
            "id": "1082e2ed-ff66-40b1-a41b-26061afd4a0b",
            "name": "test-2",
            "displayname": "test-2",
            "securitygroup": [
                {
                    "id": "9e649fbc-3e64-4395-9629-5e1215b34e58",
                    "name": "test",
                    "tags": []
                }
            ],
            "nic": [
                {
                    "id": "79568b14-b377-4d4f-b024-87dc22492b8e",
                    "networkid": "05c0e278-7ab4-4a6d-aa9c-3158620b6471"
                },
                {
                    "id": "3d7f2818-1f19-46e7-aa98-956526c5b1ad",
                    "networkid": "b4648cfd-0795-43fc-9e50-6ee9ddefc5bd",
                    "traffictype": "Guest"
                }
            ],
            "hypervisor": "KVM",
            "affinitygroup": [],
            "isdynamicallyscalable": false
        }
    ]
}

The code using json_normalize() is shown below, but it only handles top-level fields:

import pandas as pd
from pandas.io.json import json_normalize

test = json_normalize(virtual_machines["virtualmachine"])
test.to_csv("test.csv", sep="|", index=False)

Core Solution: Custom Recursive Flattening Function

To address the limitations of json_normalize(), the high-scoring answer proposes a custom recursive function that can completely flatten the entire JSON structure. This function uses depth-first traversal to concatenate nested keys, generating a flat dictionary. The implementation is as follows:

def flatten_data(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

The working principle of this function is: when encountering a dictionary, it recursively traverses its key-value pairs and accumulates key names; when encountering a list, it adds indices for each element; finally, it stores non-container type values (e.g., strings, numbers) into the output dictionary. For example, applying this function to the above JSON generates flat keys like virtualmachine.0.securitygroup.0.id.

Application and Considerations

Although this function can thoroughly flatten JSON, it may result in an excessive number of columns, especially for deeply nested structures. Therefore, in practical applications, a balance must be struck between the degree of flattening and data readability. The user ultimately chose to combine json_normalize() with a method that specifies structures, controlling the fields to be flattened via parameters, as detailed in the pandas documentation. For instance, the meta parameter can be used to specify nested paths for more controlled output.

Comparison of Supplementary Solutions

Other answers provide different implementations. An improved version handles empty lists and dictionaries and includes debug output:

from collections.abc import MutableMapping

def flatten(dictionary, parent_key=False, separator='.'):
    items = []
    for key, value in dictionary.items():
        new_key = str(parent_key) + separator + key if parent_key else key
        if isinstance(value, MutableMapping):
            if not value.items():
                items.append((new_key, None))
            else:
                items.extend(flatten(value, new_key, separator).items())
        elif isinstance(value, list):
            if len(value):
                for k, v in enumerate(value):
                    items.extend(flatten({str(k): v}, new_key, separator).items())
            else:
                items.append((new_key, None))
        else:
            items.append((new_key, value))
    return dict(items)

Another solution emphasizes that arrays should be flattened into lists of objects rather than adding indices to keys, but the implementation is more complex and suitable for specific scenarios.

Conclusion and Best Practices

Flattening multilevel nested JSON is a common requirement, and the choice of method should consider data structure, performance requirements, and output format. For simple use cases, json_normalize() with parameter adjustments may suffice; for complex nesting, custom recursive functions offer flexibility but may result in overly flat outputs. It is recommended to analyze JSON samples beforehand, identify key fields, and test different solutions to ensure data integrity. Through this discussion, readers can more effectively handle JSON data, laying a foundation for subsequent analysis.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.

Problem Background and Challenges

Core Solution: Custom Recursive Flattening Function

Application and Considerations

Comparison of Supplementary Solutions

Conclusion and Best Practices

Cite this article