Apache Tika How To Extract Html Body With Out Header And Footer Content

February 02, 2024 Post a Comment

I am looking to extract entire body content of html except header and footer, however I am getting exception org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not

Solution 1:

Foudn a solution at based on research boilerpipe detection and this is integrated with apache tika and can be run with the below java code.

import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.html.BoilerpipeContentHandler;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
import org.apache.tika.metadata.Metadata;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;   

publicclassNewtikaXpath {
    publicstaticvoidmain(String args[])throws IOException, SAXException, TikaException {
        AutoDetectParserparser=newAutoDetectParser();
        ContentHandlertextHandler=newBodyContentHandler();
        Metadataxmetadata=newMetadata();
        try  (InputStreamstream= TikaInputStream.get(newURL("your favourite url"))){
            parser.parse(stream, newBoilerpipeContentHandler(textHandler), xmetadata);
            System.out.println("text:\n" + textHandler.toString());
        }
    }

}

You can have a simple demo of boilerpipe detection at.. and more information can be also available at..

Free Interactive Html5 Tutorial

Apache Tika How To Extract Html Body With Out Header And Footer Content

Solution 1:

Post a Comment for "Apache Tika How To Extract Html Body With Out Header And Footer Content"