Skip to content Skip to sidebar Skip to footer

How Can I Remove An Entire Html Tag (and Its Contents) By Its Class Using A Regex?

I am not very good with Regex but I am learning. I would like to remove some html tag by the class name. This is what I have so far :

Solution 1:

As other people said, HTML is notoriously tricky to deal with using regexes, and a DOM approach might be better. E.g.:

useHTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file( 'yourdocument.html' );

for my $node ( $tree->findnodes( '//*[@class="footer"]' ) ) {
    $node->replace_with_content;   # delete element, but not the children
}

print$tree->as_HTML;

Solution 2:

You will also want to allow for other things before class in the div tag

<div[^>]*class="footer"[^>]*>(.*?)</div>

Also, go case-insensitive. You may need to escape things like the quotes, or the slash in the closing tag. What context are you doing this in?

Also note that HTML parsing with regular expressions can be very nasty, depending on the input. A good point is brought up in an answer below - suppose you have a structure like:

<div><divclass="footer"><div>Hi!</div></div></div>

Trying to build a regex for that is a recipe for disaster. Your best bet is to load the document into a DOM, and perform manipulations on that.

Pseudocode that should map closely to XML::DOM:

document = //load document
divs = document.getElementsByTagName("div");
for(div in divs) {
    if(div.getAttributes["class"] == "footer") {
        parent = div.getParent();
        for(child in div.getChildren()) {
            // filter attribute types?
            parent.insertBefore(div, child);
        }
        parent.removeChild(div);
    }
}


Here is a perl library, HTML::DOM, and another, XML::DOM .NET has built-in libraries to handle dom parsing.

Solution 3:

In Perl you need the /s modifier, otherwise the dot won't match a newline.

That said, using a proper HTML or XML parser to remove unwanted parts of a HTML file is much more appropriate.

Solution 4:

Partly depends on the exact regex engine you are using - which language etc. But one possibility is that you need to escape the quotes and/or the forward slash. You might also want to make it case insensitive.

<div class=\"footer\".*?>(.*?)<\/div>

Otherwise please say what language/platform you are using - .NET, java, perl ...

Solution 5:

Try this:

<([^\s]+).*?class="footer".*?>([.\n]*?)</([^\s]+)>

Your biggest problem is going to be nested tags. For example:

<divclass="footer"><b></b></div>

The regexp given would match everything through the </b>, leaving the </div> dangling on the end. You will have to either assume that the tag you're looking for has no nested elements, or you will need to use some sort of parser from HTML to DOM and an XPath query to remove an entire sub-tree.

Post a Comment for "How Can I Remove An Entire Html Tag (and Its Contents) By Its Class Using A Regex?"