Efficiently Retrieving Subfolder Names in AWS S3 Buckets Using Boto3

Keywords: AWS S3 | Boto3 | Subfolder Retrieval | Python | Object Storage

Abstract: This technical article provides an in-depth analysis of efficiently retrieving subfolder names in AWS S3 buckets, focusing on S3's flat object storage architecture and simulated directory structures. By comparing boto3.client and boto3.resource, it details the correct implementation using list_objects_v2 with Delimiter parameter, complete with code examples and performance optimization strategies to help developers avoid common pitfalls and enhance data processing efficiency.

Fundamental Characteristics of S3 Storage Structure

AWS S3 employs a flat object storage architecture rather than a traditional hierarchical file system. In S3, so-called "folders" are actually simulated by including "/" separators in object key names. For example, the key name first-level/1456753904534/part-00014 appears as file part-00014 within subfolder 1456753904534 under folder first-level in user interfaces, but S3 internally does not maintain actual directory tree structures.

Choosing Between Boto3 Client and Resource

When working with S3 directory structures, it's recommended to prioritize boto3.client over boto3.resource. While boto3.resource provides higher-level abstraction, it may have compatibility issues when handling the Delimiter parameter. boto3.client directly exposes all AWS S3 API functionalities, allowing more precise control over query behavior.

The standard way to initialize the client is as follows:

import boto3
s3_client = boto3.client("s3")

Retrieving Subfolders Using list_objects_v2 Method

list_objects_v2 is AWS's recommended modern object listing method, offering better performance and feature support compared to the traditional list_objects. By properly setting the Prefix and Delimiter parameters, you can efficiently retrieve subfolders at specific hierarchy levels.

The following code demonstrates how to retrieve all subfolders under the first-level folder:

import boto3

s3_client = boto3.client("s3")
bucket_name = "my-bucket-name"
prefix = "first-level/"

response = s3_client.list_objects_v2(
    Bucket=bucket_name,
    Prefix=prefix,
    Delimiter="/"
)

if "CommonPrefixes" in response:
    for common_prefix in response["CommonPrefixes"]:
        folder_name = common_prefix["Prefix"]
        # Extract pure folder name (remove parent path)
        pure_folder_name = folder_name.replace(prefix, "").rstrip("/")
        print(f"Subfolder: {pure_folder_name}")

Parameter Details and Best Practices

Prefix Parameter: Specifies the path prefix to query, must include the complete parent path and end with "/". For example, to query subfolders under first-level, set Prefix="first-level/".

Delimiter Parameter: When set to "/", S3 groups objects with the same prefix up to the next "/" as "CommonPrefixes", which is the key mechanism for obtaining subfolder names.

Performance Optimization: For buckets containing large numbers of objects, it's recommended to use paginators to handle results:

paginator = s3_client.get_paginator("list_objects_v2")
page_iterator = paginator.paginate(
    Bucket=bucket_name,
    Prefix=prefix,
    Delimiter="/"
)

for page in page_iterator:
    if "CommonPrefixes" in page:
        for common_prefix in page["CommonPrefixes"]:
            folder_name = common_prefix["Prefix"].replace(prefix, "").rstrip("/")
            print(f"Subfolder: {folder_name}")

Common Error Methods to Avoid

String Processing Approach: Extracting directory paths after listing all objects is extremely inefficient, especially in buckets containing massive numbers of objects:

# Not recommended implementation
import os
all_objects = s3_client.list_objects_v2(Bucket=bucket_name)
folders = set()

for obj in all_objects.get("Contents", []):
    folder_path = os.path.dirname(obj["Key"])
    if folder_path.startswith(prefix):
        folders.add(folder_path)

# This approach requires processing all objects, with huge performance overhead

Resource Object Filtering: Using boto3.resource filtering methods may not properly handle delimiters:

# May not achieve expected results
s3_resource = boto3.resource("s3")
bucket = s3_resource.Bucket(bucket_name)

for obj in bucket.objects.filter(Delimiter="/", Prefix=prefix):
    print(obj.key)  # May not correctly return subfolder information

Practical Application Scenarios and Extensions

In practical applications, retrieving subfolder names is typically used for: dynamically discovering new data directories, building directory tree navigation, batch processing data from specific time periods, and other scenarios. Combined with other S3 operations, complete data processing pipelines can be constructed.

A complete production-level implementation should consider error handling, retry mechanisms, and logging:

import boto3
from botocore.exceptions import ClientError

def get_subfolders(bucket_name, prefix, max_retries=3):
    """
    Safely retrieve all subfolders under specified prefix in S3 bucket
    """
    s3_client = boto3.client("s3")
    
    for attempt in range(max_retries):
        try:
            paginator = s3_client.get_paginator("list_objects_v2")
            page_iterator = paginator.paginate(
                Bucket=bucket_name,
                Prefix=prefix,
                Delimiter="/"
            )
            
            subfolders = []
            for page in page_iterator:
                if "CommonPrefixes" in page:
                    for common_prefix in page["CommonPrefixes"]:
                        folder_name = common_prefix["Prefix"].replace(prefix, "").rstrip("/")
                        subfolders.append(folder_name)
            
            return subfolders
            
        except ClientError as e:
            if attempt == max_retries - 1:
                raise e
            continue
    
    return []

Summary and Best Practice Recommendations

The core of efficiently retrieving S3 subfolders lies in correctly understanding S3's storage model and properly using API parameters. Key points include: always use boto3.client instead of boto3.resource; prefer the list_objects_v2 method; correctly set Prefix and Delimiter parameters; use paginators for large buckets; avoid inefficient methods of extracting directory structures through string processing.

By following these best practices, developers can build both efficient and reliable file system navigation functionality, fully leveraging S3's powerful storage capabilities.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.