Skip to content Skip to sidebar Skip to footer

Rvest Package Read_html() Function Stops Reading At "<" Symbol

I was wondering if this behavior is intentional in the rvest package. When rvest sees the < character it stops reading the HTML. library(rvest) read_html('

Solution 1:

Yes, it is normal for rvest because it's normal for html.

See the w3schools HTML Entities page. < and > are reserved characters in html and their literal values have to be written another way, as specific character entities. Here is the entity table from the linked page, giving some commonly used html characters and their respective html entities.

XML::readHTMLTable("http://www.w3schools.com/html/html_entities.asp", which = 2)
#    Result          Description Entity Name Entity Number# 1           non-breaking space      &nbsp;        &#160;# 2       <            less than        &lt;         &#60;# 3       >         greater than        &gt;         &#62;# 4       &            ampersand       &amp;         &#38;# 5       ¢                 cent      &cent;        &#162;# 6       £                pound     &pound;        &#163;# 7       ¥                  yen       &yen;        &#165;# 8       €                 euro      &euro;       &#8364;# 9       ©            copyright      &copy;        &#169;# 10      ® registered trademark       &reg;        &#174;

So you will have to replace those values, perhaps with gsub() or manually if there aren't too many. You can see that it will parse properly when those characters are replaced with the correct entity.

library(XML)
doc <- htmlParse("<html><title>under 30 years = &lt; 30 years </title></html>")
xmlValue(doc["//title"][[1]])
# [1] "under 30 years = < 30 years "

You could use gsub(), something like the following

txt <- "<html><title>under 30 years = < 30 years </title></html>"
xmlValue(htmlParse(gsub(" < ", " &lt; ", txt, fixed = TRUE))["//title"][[1]])
# [1] "under 30 years = < 30 years "

I used the XML package here, but the same applies for other packages that process html.

Post a Comment for "Rvest Package Read_html() Function Stops Reading At "<" Symbol"