Skip to content Skip to sidebar Skip to footer

I Want Ro Get All Article Content From All Links Inside From An Website

I want to extract all article content from an website using any web crawling/scraping methods. The problem is I can get content from a single page but not its redirecting link

Solution 1:

This is the solution:

package com.github.davidepastore.stackoverflow34014436;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.net.URLConnection;

import javax.swing.text.BadLocationException;
import javax.swing.text.EditorKit;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/**
 * Stackoverflow 34014436 question.
 *
 */publicclassApp {
    publicstaticvoidmain(String[] args)throws URISyntaxException,
            IOException, BadLocationException {
        HTMLDocumentdoc=newHTMLDocument() {
            public HTMLEditorKit.ParserCallback getReader(int pos) {
                returnnewHTMLEditorKit.ParserCallback() {
                    publicvoidhandleText(char[] data, int pos) {
                        System.out.println(data);
                    }
                };
            }
        };

        URLurl=newURI("http://tamilblog.ishafoundation.org/").toURL();
        URLConnectionconn= url.openConnection();
        Readerrd=newInputStreamReader(conn.getInputStream());
        OutputStreamWriterwriter=newOutputStreamWriter(
                newFileOutputStream("ram.txt"), "UTF-8");

        EditorKitkit=newHTMLEditorKit();
        kit.read(rd, doc, 0);
        try {
            Documentdocs= Jsoup.connect(
                    "http://tamilblog.ishafoundation.org/").get();

            Elementslinks= docs.select("a[href]");

            Elementselements= docs.select("*");
            System.out.println("Total Links :" + links.size());

            for (Element element : elements) {
                System.out.println(element.ownText());
            }
            for (Element link : links) {
                StringhrefUrl= link.attr("href");
                if (!"#".equals(hrefUrl) && !hrefUrl.isEmpty()) {
                    System.out.println(" * a: link :" + hrefUrl);
                    System.out.println(" * a: text :" + link.text());
                    writer.write(link.text() + " => " + hrefUrl + "\n");
                }
            }

        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            writer.close();
        }
    }
}

Here we are using the writer to write the text of every link in the ram.txt file.

Solution 2:

You should use an existing crawler such as Apache Nutch or StormCrawler.

Post a Comment for "I Want Ro Get All Article Content From All Links Inside From An Website"