JavaScript RegExp Lookbehind Alternative?
Solution 1:
First of all, you must know that most people prefer to parse html with a DOM parser, as regex can present certain hazards. That being said, for this straightforward task (no nesting), here is what you can do in regex.
Use Capture Groups
We don't have lookbehinds or \K
in JavaScript, but we can capture what we like to a capture group, then retrieve the match from that group, ignoring the rest.
This regex captures the title to Group 1:
<a [^>]*?(title="[^"]*")
On the demo, look at the Group 1 captures in the right pane: that's what we are interested in.
Sample JavaScript Code
var unique_results = [];
var yourString = 'your_test_string'
var myregex = /<a [^>]*?(title="[^"]*")/g;
var thematch = myregex.exec(yourString);
while (thematch != null) {
// is it unique?
if(unique_results.indexOf(thematch[1]) <0) {
// add it to array of unique results
unique_results.push(thematch[1]);
document.write(thematch[1],"<br />");
}
// match the next one
thematch = myregex.exec(yourString);
}
Explanation
<a
matches the beginning of the tag[^>]*?
lazily matches any chars that are not a>
, up to...(
capture grouptitle="
literal chars[^"]*
any chars that are not a quote"
closing quote)
end Group 1
Solution 2:
I am not sure if you can do this with a single regular expression in JavaScript; however, you could do something like this:
var str = '\
<a href="www.google.com" title="some title">\
<a href="www.google.com" title="some other title">\
<a href="www.google.com">\
<img href="www.google.com" title="some title">\
';
var matches = [];
//-- somewhat hacky use of .replace() in order to utilize the callback on each <a> tag
str.replace(/\<a[^\>]+\>/g, function (match) {
//-- if the <a> tag includes a title, push it onto matches
var title = match.match(/((title=".+")(?=\s*href))|(title=".+")/igm);
title && matches.push(title[0].substr(7, title[0].length - 8));
});
document.body.innerText = JSON.stringify(matches);
You should probably utilize the DOM for this, rather than regular expressions:
var str = '\
<a href="www.google.com" title="some title">Some Text</a>\
<a href="www.google.com" title="some other title">Some Text</a>\
<a href="www.google.com">Some Text</a>\
<img href="www.google.com" title="some title"/>\
';
var div = document.createElement('div');
div.innerHTML = str;
var titles = Array.apply(this, div.querySelectorAll('a[title]')).map(function (item) { return item.title; });
document.body.innerText = titles;
Solution 3:
I'm not sure where your html-sources come from, but I do know some browsers do not respect the casing (or attribute-order) of source when fetched as 'innerHTML'.
Also, both authors and browsers can use single and double quotes.
These are the most common 2 cross-browser pitfalls that I know of.
Thus, you could try: /<a [^>]*?title=(['"])([^\1]*?)\1/gi
It performs a non-greedy case-insensitive search using back-references to solve the case of single vs double quotes.
The first part is already explained by zx81's answer. \1
matches the first capturing group, thus it matches the used opening quote. Now the second capturing group should contain the bare title-string.
A simple example:
var rxp=/<a [^>]*?title=(['"])([^\1]*?)\1/gi
, res=[]
, tmp
;
while( tmp=rxp.exec(str) ){ // str is your string
res.push( tmp[2] ); //example of adding the strings to an array.
}
However as pointed out by others, it really is bad (in general) to regex tag-soup (aka HTML). Robert Messerle's alternative (using the DOM) is preferable!
Warning (I almost forgot)..
IE6 (and others?) has this nice 'memory-saving feature' to conveniently delete all unneeded quotes (for strings that don't need spaces). So, there, this regex (and zx81's) would fail, since they rely on the use of quotes!!!! Back to the drawing-board.. (a seemingly never-ending process when regexing HTML).
Post a Comment for "JavaScript RegExp Lookbehind Alternative?"