Skip to content Skip to sidebar Skip to footer

Extracting "hidden" Html With Jsoup

I am trying to get at HTML data that does not appear in the source document but can be exposed, for example, by 'inspect element' in Google Chrome. Example page: http://assignmen

Solution 1:

The data seems to loaded with AJAX. JSoup does not process Javascript.

What you need is a "headless browser" API, that processes Javascript without actually rendering anything.

HtmlUnit seems to be the best known tool, although I've never used it myself. As suggested before, Selenium Webdriver is also an option.

I believe you will have to load the URL, wait for all the AJAX to process, and you will eventually get almost the same parse tree you get in Chrome in Java to do with it as you wish!

Solution 2:

If this is the only information you will be needing, here's the JSON url to the information you seek:

http://prod-proxy-lb-2117675230.us-east-1.elb.amazonaws.com/solr/aotw/search?json.wrf=jQuery1102004354461841285229_1448413727331&q=9000000&facet.date.other=before&rows=20&start=0&wt=json&facet.date.start=NOW%2FYEAR-50YEARS&fl=id%2CreelNo%2CframeNo%2CconveyanceText%2CpatAssigneeName%2CpatAssignorName%2CinventionTitleFirst%2CapplNumFirst%2CpublNumFirst%2CpatNumFirst%2CintlRegNumFirst%2CcorrName%2CcorrAddress1%2CcorrAddress2%2CcorrAddress3%2CpatAssignorEarliestExDate%2CfilingDateFirst%2CpublDateFirst%2CissueDateFirst%2CintlPublDateFirst%2CpatNumSize&hl.fl=reelNo%2CframeNo%2CpatAssigneeName%2CpatAssignorName%2CconveyanceText%2CinventionTitleFirst%2CapplNumFirst%2CpublNumFirst%2CpatNumFirst%2CintlRegNumFirst%2CcorrName%2CcorrAddress1%2CcorrAddress2%2CcorrAddress3&hl.requireFieldMatch=true&sort=patAssignorEarliestExDate+desc%2C+id+desc

This has been retrieved by inspecting the Network tab of the Chrome developer tool, and you can get the contents of this url by using HttpConnection. An example can be found here. After getting the JSON file you can then parse it to retrieve whatever information you need.

Post a Comment for "Extracting "hidden" Html With Jsoup"