Apache Tika How To Extract Html Body With Out Header And Footer Content
I am looking to extract entire body content of html except header and footer, however I am getting exception org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not
Solution 1:
Foudn a solution at based on research boilerpipe detection and this is integrated with apache tika and can be run with the below java code.
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.html.BoilerpipeContentHandler;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
import org.apache.tika.metadata.Metadata;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
publicclassNewtikaXpath {
publicstaticvoidmain(String args[])throws IOException, SAXException, TikaException {
AutoDetectParserparser=newAutoDetectParser();
ContentHandlertextHandler=newBodyContentHandler();
Metadataxmetadata=newMetadata();
try (InputStreamstream= TikaInputStream.get(newURL("your favourite url"))){
parser.parse(stream, newBoilerpipeContentHandler(textHandler), xmetadata);
System.out.println("text:\n" + textHandler.toString());
}
}
}
You can have a simple demo of boilerpipe detection at.. and more information can be also available at..
Post a Comment for "Apache Tika How To Extract Html Body With Out Header And Footer Content"