Keywords: Python | Dictionary Reference | List Storage | Object Reference | Data Structures
Abstract: This article provides an in-depth analysis of common dictionary reference issues in Python programming. Through a practical case of extracting iframe attributes from web pages, it explains why reusing the same dictionary object in loops results in lists storing identical references. The paper elaborates on Python's object reference mechanism, offers multiple solutions including creating new dictionaries within loops, using dictionary comprehensions and copy() methods, and provides performance comparisons and best practices to help developers avoid such pitfalls.
Problem Phenomenon and Analysis
In Python development, a typical issue frequently encountered when working with data structures is that when attempting to create a list containing multiple dictionaries, all elements in the resulting list point to the same dictionary object. This phenomenon is particularly common among beginners, but its underlying principles involve Python's core object reference mechanism.
Consider the following practical scenario: extracting attribute information from all iframe tags on a webpage. The original code implementation is as follows:
site = "http://" + url
f = urllib2.urlopen(site)
web_content = f.read()
soup = BeautifulSoup(web_content)
info = {}
content = []
for iframe in soup.find_all('iframe'):
info['src'] = iframe.get('src')
info['height'] = iframe.get('height')
info['width'] = iframe.get('width')
content.append(info)
print(info)
pprint(content)
During debugging, individual print(info) statements output correct results:
{'src': u'abc.com', 'width': u'0', 'height': u'0'}
{'src': u'xyz.com', 'width': u'0', 'height': u'0'}
{'src': u'http://www.detik.com', 'width': u'1000', 'height': u'600'}
However, the final pprint(content) output shows that all dictionaries in the list are identical:
[{'height': u'600', 'src': u'http://www.detik.com', 'width': u'1000'},
{'height': u'600', 'src': u'http://www.detik.com', 'width': u'1000'},
{'height': u'600', 'src': u'http://www.detik.com', 'width': u'1000'}]
Root Cause: Object Reference Mechanism
The core of this issue lies in Python's object reference model. When executing content.append(info), it's not adding a copy of the dictionary to the list, but rather adding a reference pointing to the same dictionary object. In each iteration of the loop, we're modifying the same dictionary object and then adding another reference to it.
This mechanism can be verified with a simplified example:
>>> d = {}
>>> dlist = []
>>> for i in range(3):
... d['data'] = i
... dlist.append(d)
... print(d)
...
{'data': 0}
{'data': 1}
{'data': 2}
>>> print(dlist)
[{'data': 2}, {'data': 2}, {'data': 2}]
Using the id() function makes it clearer that all list items point to the same object:
>>> for item in dlist:
... print("List item points to object ID:", id(item))
...
List item points to object ID: 47472232
List item points to object ID: 47472232
List item points to object ID: 47472232
Solutions
Method 1: Create New Dictionary Within Loop
The most direct and effective solution is to create a new dictionary object in each loop iteration:
for iframe in soup.find_all('iframe'):
info = {}
info['src'] = iframe.get('src')
info['height'] = iframe.get('height')
info['width'] = iframe.get('width')
content.append(info)
A more elegant implementation creates the complete dictionary directly within the loop:
for iframe in soup.find_all('iframe'):
info = {
"src": iframe.get('src'),
"height": iframe.get('height'),
"width": iframe.get('width')
}
content.append(info)
Method 2: Using Dictionary Comprehension
For more complex scenarios, dictionary comprehension can be used to create the list:
content = [
{
"src": iframe.get('src'),
"height": iframe.get('height'),
"width": iframe.get('width')
}
for iframe in soup.find_all('iframe')
]
Method 3: Using copy() Method
Another solution is to use the dictionary's copy() method to create copies:
info = {}
for iframe in soup.find_all('iframe'):
info['src'] = iframe.get('src')
info['height'] = iframe.get('height')
info['width'] = iframe.get('width')
content.append(info.copy())
Verifying the effectiveness of this method:
>>> dlist = []
>>> for i in range(3):
... d['data'] = i
... dlist.append(d.copy())
... print(d)
...
{'data': 0}
{'data': 1}
{'data': 2}
>>> print(dlist)
[{'data': 0}, {'data': 1}, {'data': 2}]
Checking object IDs confirms that different objects were created:
>>> for item in dlist:
... print("List item points to object ID:", id(item))
...
List item points to object ID: 33861576
List item points to object ID: 47472520
List item points to object ID: 47458120
Performance Considerations and Best Practices
In data processing scenarios, performance is an important consideration. Referring to performance tests of related dataset conversions, we can observe efficiency differences between various methods:
For a dataset containing 100 columns and 10,000 rows, the average time for directly looping through the dataset to convert to a list of dictionaries is 876.99 milliseconds, while the method of first converting to PyDataset then processing averages 492.04 milliseconds. This indicates that in large dataset processing, choosing appropriate data structures and methods significantly impacts performance.
Best practice recommendations:
- Creating new dictionaries within loops is the clearest and least error-prone method
- For simple data structure conversions, prioritize dictionary comprehensions
- When processing large datasets, consider performance-optimized methods
- Always verify that generated data structures meet expectations
Conclusion
Python's object reference mechanism is part of its powerful functionality, but requires developers to have a deep understanding. When working with mutable objects (such as dictionaries, lists), it's essential to distinguish between references and copies. Through proper programming patterns, common pitfalls can be avoided, resulting in more robust and efficient code.
In practical development, it's recommended to adopt the method of creating new dictionaries within loops. This not only solves the reference issue but also makes the code's intent clearer, facilitating maintenance and debugging.