java - Why can't Apache HttpClient 4.2 retrieve this page? -
i'm trying retrieve page using apache httpclient: http://quick-dish.tablespoon.com/
unfortunately, when try this, returns following (as returned jsoup, it's returning http... string itself):
<html> <head></head> <body> http/1.1 200 ok [server: nginx/1.0.11, content-type: text/html;charset=utf-8, last-modified: mon, 02 jul 2012 15:30:40 gmt, vary: accept-encoding, cookie,accept-encoding, x-powered-by: php/5.3.6, x-pingback: http://quick-dish.tablespoon.com/xmlrpc.php, x-powered-by: asp.net, content-encoding: gzip, x-blz: lb1.blaze.io, date: mon, 02 jul 2012 16:06:21 gmt, content-length: 11723, connection: keep-alive] </body> </html>
here code (note i'm emulating google bot i've found web servers tend better behaved way):
url sourceurl = new url("http://quick-dish.tablespoon.com/"); httpclient httpclient = new contentencodinghttpclient(); httpclient.getparams().setbooleanparameter("http.protocol.handle-redirects", true); final httpget httpget = new httpget(sourceurl.touri()); httpget.setheader("user-agent", "mozilla/5.0 (compatible; googlebot/2.1; +http://www.google.com/bot.html)"); httpget.setheader("accept", "text/html"); httpget.setheader("accept-charset", "utf-8"); final httpresponse response = httpclient.execute(httpget); return jsoup.parse(response.tostring());
needless say, page returns fine in web browser. ideas?
instead of tostring need response entity
// hold of response entity httpentity entity = response.getentity();
then can contents of that
Comments
Post a Comment