Applying XPath following-sibling Axis: Extracting Data from Newegg Product Specification Tables

Nov 30, 2025 · Programming · 13 views · 7.8

Keywords: XPath | following-sibling | data extraction | HTML parsing | lxml

Abstract: This article provides an in-depth exploration of the XPath following-sibling axis usage, using Newegg website product specification table data extraction as a case study. By analyzing HTML document structure, it details how to use the following-sibling::td axis to locate adjacent sibling elements and compares it with the more concise tr[td[@class='name']='Brand']/td[@class='desc'] expression. The article also covers basic XPath axis concepts, practical application scenarios, and implementation code in Python lxml library, offering a comprehensive technical solution for web data scraping.

XPath Axis Fundamentals

XPath (XML Path Language) is an expression language used for navigating and selecting elements in XML and HTML documents. Unlike CSS selectors that only support unidirectional traversal, XPath supports bidirectional traversal, including horizontal sibling element positioning. In XPath, axes are used to define relationships between elements in a document, allowing element selection based on position or relationship to other elements.

Detailed Explanation of following-sibling Axis

The following-sibling axis is used to select all sibling elements that appear after the current element. In the Newegg product specification table case, each <tr> element contains two <td> child elements: one with class="name" for the title cell and one with class="desc" for the data cell. To extract data corresponding to specific titles, the following-sibling axis can be used.

Basic syntax: /parent/current-element/following-sibling::target

Newegg Data Extraction Solutions

For the user's Newegg product specification table data extraction problem, there are two main solutions:

Solution 1: Using following-sibling Axis

Directly use the following-sibling axis to locate adjacent data cells:

tr/td[@class='name']/following-sibling::td

Implementation code in Python lxml library:

from lxml import html # Parse HTML document parsed_document = html.fromstring(html_content) # Extract brand information brand_element = parsed_document.xpath("//tr/td[@class='name']/following-sibling::td")[0] CPU.brand = brand_element.text

Solution 2: Using Predicate Expressions

A more concise approach uses predicates to directly locate rows containing specific titles:

tr[td[@class='name']='Brand']/td[@class='desc']

Complete implementation in Python:

class CPU: def __init__(self): self.brand = None self.series = None self.cores = None self.socket = None def extract_cpu_specs(parsed_document): cpu = CPU() # Extract brand brand_elements = parsed_document.xpath("//tr[td[@class='name']='Brand']/td[@class='desc']") if brand_elements: cpu.brand = brand_elements[0].text # Extract series series_elements = parsed_document.xpath("//tr[td[@class='name']='Series']/td[@class='desc']") if series_elements: cpu.series = series_elements[0].text # Extract cores cores_elements = parsed_document.xpath("//tr[td[@class='name']='Cores']/td[@class='desc']") if cores_elements: cpu.cores = cores_elements[0].text # Extract socket type socket_elements = parsed_document.xpath("//tr[td[@class='name']='Socket']/td[@class='desc']") if socket_elements: cpu.socket = socket_elements[0].text return cpu

Technical Analysis

Advantages of using the second solution include:

It's important to note that this approach relies on two key assumptions:

  1. The context node for the XPath expression is the parent of all <tr> elements
  2. Each <tr> element has exactly one <td> with class="name" and one <td> with class="desc"

Practical Application Extensions

The following-sibling axis is not limited to simple adjacent sibling positioning but can be combined with other XPath features for more complex selections:

// Select specific position following-sibling tr/td[@class='name']/following-sibling::td[1] // Combine with attribute filtering tr/td[@class='name']/following-sibling::td[@class='desc'] // Use position function tr/td[@class='name']/following-sibling::td[position()=1]

These advanced usages are particularly valuable when dealing with more complex document structures, providing more precise element positioning capabilities.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.