I Want Ro Get All Article Content From All Links Inside From An Website
I want to extract all article content from an website using any web crawling/scraping methods. The problem is I can get content from a single page but not its redirecting link
Solution 1:
This is the solution:
package com.github.davidepastore.stackoverflow34014436;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.net.URLConnection;
import javax.swing.text.BadLocationException;
import javax.swing.text.EditorKit;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
/**
* Stackoverflow 34014436 question.
*
*/publicclassApp {
publicstaticvoidmain(String[] args)throws URISyntaxException,
IOException, BadLocationException {
HTMLDocumentdoc=newHTMLDocument() {
public HTMLEditorKit.ParserCallback getReader(int pos) {
returnnewHTMLEditorKit.ParserCallback() {
publicvoidhandleText(char[] data, int pos) {
System.out.println(data);
}
};
}
};
URLurl=newURI("http://tamilblog.ishafoundation.org/").toURL();
URLConnectionconn= url.openConnection();
Readerrd=newInputStreamReader(conn.getInputStream());
OutputStreamWriterwriter=newOutputStreamWriter(
newFileOutputStream("ram.txt"), "UTF-8");
EditorKitkit=newHTMLEditorKit();
kit.read(rd, doc, 0);
try {
Documentdocs= Jsoup.connect(
"http://tamilblog.ishafoundation.org/").get();
Elementslinks= docs.select("a[href]");
Elementselements= docs.select("*");
System.out.println("Total Links :" + links.size());
for (Element element : elements) {
System.out.println(element.ownText());
}
for (Element link : links) {
StringhrefUrl= link.attr("href");
if (!"#".equals(hrefUrl) && !hrefUrl.isEmpty()) {
System.out.println(" * a: link :" + hrefUrl);
System.out.println(" * a: text :" + link.text());
writer.write(link.text() + " => " + hrefUrl + "\n");
}
}
} catch (Exception e) {
e.printStackTrace();
} finally {
writer.close();
}
}
}
Here we are using the writer
to write the text of every link in the ram.txt
file.
Solution 2:
You should use an existing crawler such as Apache Nutch or StormCrawler.
Post a Comment for "I Want Ro Get All Article Content From All Links Inside From An Website"