JavaScript Regular Expressions: A Comprehensive Guide to Extracting Text Between HTML Tags

Dec 07, 2025 · Programming · 6 views · 7.8

Keywords: JavaScript | Regular Expressions | HTML Text Extraction

Abstract: This article delves into the technique of using regular expressions in JavaScript to extract text between HTML tags, focusing on the application of the global flag (g), differences between match() and exec() methods, and extended patterns for handling tags with attributes. By reconstructing code examples from the Q&A, it explains the principles of non-greedy matching (.*?) and the text-cleaning process with map() and replace(), offering a complete solution from basic to advanced levels for developers.

Regular Expression Basics and Global Matching

When processing HTML strings in JavaScript, regular expressions serve as an efficient tool for extracting text between specific tags. In the original problem, the user attempted /<b>(.*?)<\/b>/.exec(str), but this method only returns the first match. The key improvement is adding the global flag g, forming /<b>(.*?)<\/b>/g, which enables the regex engine to traverse the entire string and capture all content matching the pattern.

match() Method and Text Processing Flow

Using str.match() with a global regular expression directly returns an array of all matches. For example: str.match(/<b>(.*?)<\/b>/g) yields ["<b>Bob</b>", "<b>20</b>", "<b>programming</b>"]. To extract pure text, post-processing via map() and replace() is required:

var result = str.match(/<b>(.*?)<\/b>/g).map(function(val) {
    return val.replace(/<\/?b>/g, '');
});

This code first matches all <b> tag pairs, then removes the tags themselves, ultimately outputting ["Bob", "20", "programming"].

Non-Greedy Matching and Handling Tags with Attributes

The pattern (.*?) employs non-greedy matching, ensuring it stops at the first occurrence of <\/b> to avoid cross-tag capture. When tags include attributes, the regular expression must be extended to /<b [^>]+>(.*?)<\/b>/g, where [^>]+ matches any sequence of non-> characters, accommodating complex structures like <b class=\"bold\">.

Comparative Analysis of exec() and match()

While the exec() method can achieve global matching through loops, match() is more concise for simple extraction scenarios. For instance, exec() requires iterative calls:

var regex = /<b>(.*?)<\/b>/g;
var matches = [];
while ((match = regex.exec(str)) !== null) {
    matches.push(match[1]);
}

This method directly captures grouped content without additional cleaning, though the code is slightly more verbose.

Practical Applications and Considerations

In real-world development, regular expression extraction of HTML text is suitable for lightweight parsing; however, for nested or complex documents, using a DOM parser like DOMParser is recommended. Additionally, escaped characters such as &lt; and &gt; must be handled before matching to prevent pattern failure. By integrating the code examples and theoretical analysis in this article, developers can flexibly address various text extraction needs.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.