Keywords: Python | BeautifulSoup | HTML Parsing | Table Extraction | Web Scraping
Abstract: This article demonstrates how to use Python's BeautifulSoup library to parse HTML tables, using the NYC parking ticket website as an example. It covers the core method of extracting table data, handling edge cases, and provides alternative approaches with pandas. The content is structured for clarity and includes code examples with explanations.
In this article, we explore how to parse HTML tables using the BeautifulSoup library in Python, with a practical example from the NYC parking ticket website.
Introduction
Web scraping is a common task in data extraction, and HTML tables are a frequent source of structured data. The NYC parking ticket website provides a table of line items that can be parsed using BeautifulSoup.
Core Parsing Method
The key steps involve fetching the HTML content, parsing it with BeautifulSoup, locating the target table by its class attribute, and then iterating through rows and cells to extract text data.
import requests
from bs4 import BeautifulSoup
url = "https://paydirect.link2gov.com/NYCParking-Plate/ItemSearch"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'class': 'lineItemsTable'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
data = []
for row in rows:
cols = row.find_all('td')
cols = [col.text.strip() for col in cols]
cols = [col for col in cols if col] # Remove empty strings
data.append(cols)
print(data)This code snippet demonstrates the core method. It starts by importing necessary libraries, then fetches and parses the HTML. The table is found using its class, and the tbody is accessed to get all rows. Each row's cells are extracted, stripped of whitespace, and filtered for non-empty values before appending to the data list.
Handling Edge Cases
In the parsed data, you may encounter rows with fewer columns or additional elements like payment amounts. For instance, the last row in the example output contains a payment amount that is not part of the standard table data. To handle this, you can check the length of the cols list and skip rows that do not meet the expected column count.
# Example of filtering rows based on column count
filtered_data = [row for row in data if len(row) >= 7] # Assuming at least 7 columns for valid dataAdditionally, some cells might contain input elements or other HTML structures, so using .text.strip() ensures we get the textual content.
Alternative Approaches
For simpler scenarios, the pandas library offers a convenient function <code>read_html</code> that can parse tables directly from URLs or HTML strings. This method is useful when you need quick table extraction without detailed control.
import pandas as pd
url = "https://paydirect.link2gov.com/NYCParking-Plate/ItemSearch"
df_list = pd.read_html(url)
df = df_list[0] # Access the first table
print(df.head())However, BeautifulSoup provides more flexibility for complex parsing tasks, such as when tables have dynamic content or require specific handling.
Conclusion
Using BeautifulSoup for HTML table parsing is a powerful approach for web scraping. It allows precise control over the extraction process and can handle various edge cases. Always test your code with different inputs and consider using libraries like pandas for simpler tasks.