I am interested in crawling websites automatically. Ideally, I would be able to get an image capture and a HTML source every few minutes.

However, there are some sites that resist asynchronous HTTP transfers. One example is the Yahoo front page. If you access it asynchronously, perhaps through a Java script and employing a method like 'readRawSource' or 'loadStrings', et. al., you're making an asynchronous request. The response you get in return is never what is showing up on the web page at the time you make the asynchronous request.

Is it simply impossible to crawl a website like this? Or, can some kind of browser emulation be performed in a Java applet that makes a synchronous request of a problem website, saves the source and saves an image capture?