java.net question: Size of downloaded document
Got a question about java.net
I am trying to create a simple application that connects to a URL, downloads an HTML page and prints the size of the page.
Seemed pretty simple when I started out.
Using URLConnection.getInputStream(), I read in the document.
Then, I find out the length of the document using URLConnection.getContentLength(), which returns the size of the HTML doc in bytes.
So far, so good.
Now the only problem here is that the size returned is just the size of the file containing the HTML text. However, the page can have lots of images in it referenced by <img src ="....">, which increase the actual size of the downloaded page.
Question1: Is there any way to get the TOTAL SIZE of the page (which includes the size of all images as well)??
(Otherwise, the only option left for me is the parse the downloaded HTML text, search for <img src> and download each image seperately using the image URL)
I am also timing the time taken for download, by starting a timer before establishing the connection and ending it after reading in the input stream. Curiously, the first time I run the application giving it some URL, it gives a download time which looks real. But after that, for ever subsequent run for the same URL, the download time gets reduced by almost 1/8th. I assumed some kind of caching was going on, so I set URLConnection.setUseCaches(false).
But to no avail. Unless I kill the application, and start it again, some kind of caching is going on, which gives a much smaller download time for every run on the same URL after that first one.
Would somebody help me out and make my life easier?
the presence of image tags does not increase the size of a downloaded page.. even a web browser like IE must download the html, read through it, pull the links of the images, load the images, then display them in place... the presence of an img tag does not thus magically increase the size of the original document.
if you want to give a "total bytes left to download" counter then yes, you must pull all the img links, then fire off requests for them and add their size into the total.
did you ever notice that IE only says "20 items left to download" - it doesnt say their sizes? thats cause it knows from the original document that there were 20 images, but it doesnt know the sizes
there are more factors than just the downloading of the page.. the time it takes to start java is lowering too, because your computer is doing caching anyway.. everything these days has caches.. you have a web cache, your network card has a buffer, the hard disk has a cache, and windows caches the hard disk again in memory..
too much cold hard cache.. and not enough cold hard cash..
Thanks CJ. Guess, I'll have to sweat it out.
By the way, I guess I wasn't very clear about my problem. I know the presence of <img> tag does not increase the size of the page. (C'mon man, I might type slow but I ain't dumb ). My query was that, does Java provide a method which can parse the downloaded HTML code for me, pick out the image URLs and download them as well? So that the final downloaded document that I have includes everthing.
Guess, Java doesn't.
(On a philosophical note, cache can't buy you happiness.Neither can cash. )
Regarding the Cache problem, I checked out the sun forums. There were quite a few posts there stating the same problem. No one had a solution.
This thread for instance:
One of the posters offered the following resolution:
URLconnection connect= myurl.openConnection();
I tried but it still doesn't work
Come on Java gurus... help me out with this. How do I get rid of this caching.
Top DevX Stories
Easy Web Services with SQL Server 2005 HTTP Endpoints
JavaOne 2005: Java Platform Roadmap Focuses on Ease of Development, Sun Focuses on the "Free" in F.O.S.S.
Wed Yourself to UML with the Power of Associations
Microsoft to Add AJAX Capabilities to ASP.NET
IBM's Cloudscape Versus MySQL