Applying XPath following-sibling Axis: Extracting Data from Newegg Product Specification Tables

Keywords: XPath | following-sibling | data extraction | HTML parsing | lxml

Abstract: This article provides an in-depth exploration of the XPath following-sibling axis usage, using Newegg website product specification table data extraction as a case study. By analyzing HTML document structure, it details how to use the following-sibling::td axis to locate adjacent sibling elements and compares it with the more concise tr[td[@class='name']='Brand']/td[@class='desc'] expression. The article also covers basic XPath axis concepts, practical application scenarios, and implementation code in Python lxml library, offering a comprehensive technical solution for web data scraping.

XPath Axis Fundamentals

XPath (XML Path Language) is an expression language used for navigating and selecting elements in XML and HTML documents. Unlike CSS selectors that only support unidirectional traversal, XPath supports bidirectional traversal, including horizontal sibling element positioning. In XPath, axes are used to define relationships between elements in a document, allowing element selection based on position or relationship to other elements.

Detailed Explanation of following-sibling Axis

The following-sibling axis is used to select all sibling elements that appear after the current element. In the Newegg product specification table case, each <tr> element contains two <td> child elements: one with class="name" for the title cell and one with class="desc" for the data cell. To extract data corresponding to specific titles, the following-sibling axis can be used.

Basic syntax: /parent/current-element/following-sibling::target

Newegg Data Extraction Solutions

For the user's Newegg product specification table data extraction problem, there are two main solutions:

Solution 1: Using following-sibling Axis

Directly use the following-sibling axis to locate adjacent data cells:

tr/td[@class='name']/following-sibling::td

Implementation code in Python lxml library:

from lxml import html

# Parse HTML document
parsed_document = html.fromstring(html_content)

# Extract brand information
brand_element = parsed_document.xpath("//tr/td[@class='name']/following-sibling::td")[0]
CPU.brand = brand_element.text

Solution 2: Using Predicate Expressions

A more concise approach uses predicates to directly locate rows containing specific titles:

tr[td[@class='name']='Brand']/td[@class='desc']

Complete implementation in Python:

class CPU:
    def __init__(self):
        self.brand = None
        self.series = None
        self.cores = None
        self.socket = None

def extract_cpu_specs(parsed_document):
    cpu = CPU()
    
    # Extract brand
    brand_elements = parsed_document.xpath("//tr[td[@class='name']='Brand']/td[@class='desc']")
    if brand_elements:
        cpu.brand = brand_elements[0].text
    
    # Extract series
    series_elements = parsed_document.xpath("//tr[td[@class='name']='Series']/td[@class='desc']")
    if series_elements:
        cpu.series = series_elements[0].text
    
    # Extract cores
    cores_elements = parsed_document.xpath("//tr[td[@class='name']='Cores']/td[@class='desc']")
    if cores_elements:
        cpu.cores = cores_elements[0].text
    
    # Extract socket type
    socket_elements = parsed_document.xpath("//tr[td[@class='name']='Socket']/td[@class='desc']")
    if socket_elements:
        cpu.socket = socket_elements[0].text
    
    return cpu

Technical Analysis

Advantages of using the second solution include:

More concise code that directly targets elements
Avoidance of complex axis traversal
Better readability and maintainability
Improved performance by reducing unnecessary node traversal

It's important to note that this approach relies on two key assumptions:

The context node for the XPath expression is the parent of all <tr> elements
Each <tr> element has exactly one <td> with class="name" and one <td> with class="desc"

Practical Application Extensions

The following-sibling axis is not limited to simple adjacent sibling positioning but can be combined with other XPath features for more complex selections:

// Select specific position following-sibling
tr/td[@class='name']/following-sibling::td[1]

// Combine with attribute filtering
tr/td[@class='name']/following-sibling::td[@class='desc']

// Use position function
tr/td[@class='name']/following-sibling::td[position()=1]

These advanced usages are particularly valuable when dealing with more complex document structures, providing more precise element positioning capabilities.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.