Getting the content of HTML tags with java
Hello... Two questions, really.
First off, I have a java Document object representing an HTML page, say,
And I want to, with my program, get the text between the <title> </title> tags, so that I can know what the page's title is.
<title>Dumb HTML File</title>
<a href="http://tacorner.com/tsunami/dumb.html">Dumb link to self</a>
What I CAN do so far is get an Element object (javax.swing.text.Element) representing the <title> tag. I just cannot get the text between the <title> and </title> tags. I've spent a modest three hours digging through the API and trying stuff to no avail.
How do I access text between html tags with java? Is it possible with the Document interface and javax.swing.text.Element? If not, what do I need to do?
Second question. I've got a class extending JTextPane and displaying an HTML file. What I'd like to be able to do is call getDocument() or getStyledDocument() and get an object to work with. However, every call to those methods returns a default, empty document with none of the page's information.
There is one way to get a filled-in Document object, though: I have to have the HTML page link to itself. Then, once that link is clicked and the page navigates back to itself, the getDocument() and getStyledDocument() calls will give me a complete Document object.
Any ideas why that is, and what I can do to have getDocument() and getStyledDocument() work on the FIRST call?
Source Code: http://tsunami.tacorner.com/src/TsunamiWindow.java
Thanks in advance.
Here's an example to get all the links.
URL url = new URL("http://something.com");
URLConnection connection = url.openConnection();
Reader reader = new InputStreamReader(connection.getInputStream());
EditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
//uncomment this line incase you get character set issues
//doc.putProperty("IgnoreCharsetDirective", new Boolean(true));
kit.read(reader, doc, 0);
//Get all <a> tags (hyperlinks)
HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
MutableAttributeSet mas = (MutableAttributeSet)it.getAttributes();
//get the HREF attribute value in the <a> tag
String link = (String)mas.getAttribute(HTML.Attribute.HREF);
That's nice, but that gets the stuff inside the href="", which is actually part of the <a> tag, a la
<a href="http://somelink.com">Linked text</a>
What I need to do is pull out "linked text"
That I'm not actually sure. You'd think it'd be as simple as:
I thought HTML.Tag things were constants, though... how would it differentiate between multiple links?
Last Post: 08-25-2005, 07:21 PM
By Todd Miller in forum Enterprise
Last Post: 09-12-2003, 11:34 PM
By Anne Marie in forum Database
Last Post: 03-06-2002, 04:47 PM
By Shantanu in forum Java
Last Post: 12-06-2001, 02:33 PM
Last Post: 08-29-2001, 02:18 PM
Top DevX Stories
Easy Web Services with SQL Server 2005 HTTP Endpoints
JavaOne 2005: Java Platform Roadmap Focuses on Ease of Development, Sun Focuses on the "Free" in F.O.S.S.
Wed Yourself to UML with the Power of Associations
Microsoft to Add AJAX Capabilities to ASP.NET
IBM's Cloudscape Versus MySQL