Keywords: XPath | following-sibling | data extraction | HTML parsing | lxml
Abstract: This article provides an in-depth exploration of the XPath following-sibling axis usage, using Newegg website product specification table data extraction as a case study. By analyzing HTML document structure, it details how to use the following-sibling::td axis to locate adjacent sibling elements and compares it with the more concise tr[td[@class='name']='Brand']/td[@class='desc'] expression. The article also covers basic XPath axis concepts, practical application scenarios, and implementation code in Python lxml library, offering a comprehensive technical solution for web data scraping.
XPath Axis Fundamentals
XPath (XML Path Language) is an expression language used for navigating and selecting elements in XML and HTML documents. Unlike CSS selectors that only support unidirectional traversal, XPath supports bidirectional traversal, including horizontal sibling element positioning. In XPath, axes are used to define relationships between elements in a document, allowing element selection based on position or relationship to other elements.
Detailed Explanation of following-sibling Axis
The following-sibling axis is used to select all sibling elements that appear after the current element. In the Newegg product specification table case, each <tr> element contains two <td> child elements: one with class="name" for the title cell and one with class="desc" for the data cell. To extract data corresponding to specific titles, the following-sibling axis can be used.
Basic syntax: /parent/current-element/following-sibling::target
Newegg Data Extraction Solutions
For the user's Newegg product specification table data extraction problem, there are two main solutions:
Solution 1: Using following-sibling Axis
Directly use the following-sibling axis to locate adjacent data cells:
tr/td[@class='name']/following-sibling::td
Implementation code in Python lxml library:
from lxml import html
# Parse HTML document
parsed_document = html.fromstring(html_content)
# Extract brand information
brand_element = parsed_document.xpath("//tr/td[@class='name']/following-sibling::td")[0]
CPU.brand = brand_element.text
Solution 2: Using Predicate Expressions
A more concise approach uses predicates to directly locate rows containing specific titles:
tr[td[@class='name']='Brand']/td[@class='desc']
Complete implementation in Python:
class CPU:
def __init__(self):
self.brand = None
self.series = None
self.cores = None
self.socket = None
def extract_cpu_specs(parsed_document):
cpu = CPU()
# Extract brand
brand_elements = parsed_document.xpath("//tr[td[@class='name']='Brand']/td[@class='desc']")
if brand_elements:
cpu.brand = brand_elements[0].text
# Extract series
series_elements = parsed_document.xpath("//tr[td[@class='name']='Series']/td[@class='desc']")
if series_elements:
cpu.series = series_elements[0].text
# Extract cores
cores_elements = parsed_document.xpath("//tr[td[@class='name']='Cores']/td[@class='desc']")
if cores_elements:
cpu.cores = cores_elements[0].text
# Extract socket type
socket_elements = parsed_document.xpath("//tr[td[@class='name']='Socket']/td[@class='desc']")
if socket_elements:
cpu.socket = socket_elements[0].text
return cpu
Technical Analysis
Advantages of using the second solution include:
- More concise code that directly targets elements
- Avoidance of complex axis traversal
- Better readability and maintainability
- Improved performance by reducing unnecessary node traversal
It's important to note that this approach relies on two key assumptions:
- The context node for the XPath expression is the parent of all <tr> elements
- Each <tr> element has exactly one <td> with class="name" and one <td> with class="desc"
Practical Application Extensions
The following-sibling axis is not limited to simple adjacent sibling positioning but can be combined with other XPath features for more complex selections:
// Select specific position following-sibling
tr/td[@class='name']/following-sibling::td[1]
// Combine with attribute filtering
tr/td[@class='name']/following-sibling::td[@class='desc']
// Use position function
tr/td[@class='name']/following-sibling::td[position()=1]
These advanced usages are particularly valuable when dealing with more complex document structures, providing more precise element positioning capabilities.