Getting the content of HTML tags with java


DevX Home    Today's Headlines   Articles Archive   Tip Bank   Forums   

Results 1 to 5 of 5

Thread: Getting the content of HTML tags with java

  1. #1
    Join Date
    May 2005
    Posts
    22

    Getting the content of HTML tags with java

    Hello... Two questions, really.

    First off, I have a java Document object representing an HTML page, say,

    Code:
    <html>
    <head>
    <title>Dumb HTML File</title>
    </head>
    
    <body bgcolor="#FFFFFF">
    <a href="http://tacorner.com/tsunami/dumb.html">Dumb link to self</a>
    </body>
    
    
    </html>
    And I want to, with my program, get the text between the <title> </title> tags, so that I can know what the page's title is.

    What I CAN do so far is get an Element object (javax.swing.text.Element) representing the <title> tag. I just cannot get the text between the <title> and </title> tags. I've spent a modest three hours digging through the API and trying stuff to no avail.

    How do I access text between html tags with java? Is it possible with the Document interface and javax.swing.text.Element? If not, what do I need to do?

    Second question. I've got a class extending JTextPane and displaying an HTML file. What I'd like to be able to do is call getDocument() or getStyledDocument() and get an object to work with. However, every call to those methods returns a default, empty document with none of the page's information.

    There is one way to get a filled-in Document object, though: I have to have the HTML page link to itself. Then, once that link is clicked and the page navigates back to itself, the getDocument() and getStyledDocument() calls will give me a complete Document object.

    Any ideas why that is, and what I can do to have getDocument() and getStyledDocument() work on the FIRST call?


    Source Code: http://tsunami.tacorner.com/src/TsunamiWindow.java

    Thanks in advance.

  2. #2
    Join Date
    Mar 2004
    Posts
    635
    Here's an example to get all the links.

    Code:
    URL url = new URL("http://something.com");
    URLConnection connection = url.openConnection();
    
    Reader reader = new InputStreamReader(connection.getInputStream());
    
    EditorKit kit = new HTMLEditorKit();
    
    HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
    
    //uncomment this line incase you get character set issues
    //doc.putProperty("IgnoreCharsetDirective", new Boolean(true));
    
    kit.read(reader, doc, 0);
    
    //Get all <a> tags (hyperlinks)
    HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
    
    while (it.isValid())
    {
        MutableAttributeSet mas = (MutableAttributeSet)it.getAttributes();
        
        //get the HREF attribute value in the <a> tag
        String link = (String)mas.getAttribute(HTML.Attribute.HREF);
        System.out.println(link);
       
        it.next();
    }

  3. #3
    Join Date
    May 2005
    Posts
    22
    That's nice, but that gets the stuff inside the href="", which is actually part of the <a> tag, a la

    <a href="http://somelink.com">Linked text</a>

    What I need to do is pull out "linked text"

  4. #4
    Join Date
    Mar 2004
    Posts
    635
    That I'm not actually sure. You'd think it'd be as simple as:

    doc.getTagValue(HTML.Tag.A)

  5. #5
    Join Date
    May 2005
    Posts
    22
    I thought HTML.Tag things were constants, though... how would it differentiate between multiple links?

Similar Threads

  1. Replies: 2
    Last Post: 08-25-2005, 06:21 PM
  2. Microsoft Fax does not recognize all HTML tags.
    By Todd Miller in forum Enterprise
    Replies: 0
    Last Post: 09-12-2003, 10:34 PM
  3. ignore HTML tags in SQL full text index
    By Anne Marie in forum Database
    Replies: 0
    Last Post: 03-06-2002, 03:47 PM
  4. Convert HTML to MS word in java??
    By Shantanu in forum Java
    Replies: 2
    Last Post: 12-06-2001, 01:33 PM
  5. html to java
    By alex in forum Java
    Replies: 3
    Last Post: 08-29-2001, 01:18 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center
 
 
FAQ
Latest Articles
Java
.NET
XML
Database
Enterprise
Questions? Contact us.
C++
Web Development
Wireless
Latest Tips
Open Source


   Development Centers

   -- Android Development Center
   -- Cloud Development Project Center
   -- HTML5 Development Center
   -- Windows Mobile Development Center