StreamTokeniser trouble


DevX Home    Today's Headlines   Articles Archive   Tip Bank   Forums   

Results 1 to 3 of 3

Thread: StreamTokeniser trouble

  1. #1
    Join Date
    Nov 2004
    Posts
    9

    Question StreamTokeniser trouble

    Hi all,

    im trying to build a text classification program that reads in a text file, tokenises it and stores it in a hashmap. Along the way i would like to instantiate another class which reads a text file full of commonwords. I would then like to remove these common words from the initial hashmap. so problems at the moment are that i cant instantiate the arraylist and cannot remove the commonwords from the hashmap. I have set up two classes one called CommonWords.java and the other called Train.java.

    Train.java
    Code:
    import java.io.*; 
    import java.util.*;
    import javax.swing.JOptionPane;
    import java.util.ArrayList;
    
    public class Train {
        
        public FileReader file;
        public StreamTokenizer st;
        public HashMap counts = new HashMap();
       //public HashMap cwords = new HashMap();
       //public Hashtable cword;
        
    /** Creates a new instance of train */
    public Train() {
        
           ArrayList cw = new ArrayList();
            
           String fileName = "SPAM.txt";          
           String data;       
           counts = new HashMap(); 
           //cwords = new HashMap(); 
           int tokenType = 0;
           int numberOfTokens = 0;
           
           try 
           {           
            FileReader file = new FileReader(fileName);
            StreamTokenizer st = new StreamTokenizer(new BufferedReader(file));
            
            st.ordinaryChar('!');
            st.ordinaryChar('$');
            st.ordinaryChar('"');
            st.whitespaceChars('/','/');
            st.whitespaceChars('\\','\\');
            st.whitespaceChars('.','.');
            st.whitespaceChars(',',',');
            st.whitespaceChars(';',';');
            st.whitespaceChars(':',':');
            st.whitespaceChars('=','=');
            st.whitespaceChars('\'','\'');
            st.whitespaceChars('`','`');
            st.whitespaceChars('[',']');      
           
         while(st.nextToken() != StreamTokenizer.TT_EOF) {
             
           String s;
           
           switch(st.ttype) {
                       
             case StreamTokenizer.TT_WORD:
               s = st.sval; // Already a String
               
              // System.out.println("Token Extracted = " + st.sval);
               st.lowerCaseMode(true);
               numberOfTokens++;
               break;
               
             default: 
               s = String.valueOf((char)st.ttype);
               
           }
           
           if(counts.containsKey(s))
             ((Counter)counts.get(s)).increment();
           else
             counts.put(s, new Counter());
         }
           }     
       catch(IOException e) 
       {
         System.out.println("st.nextToken() unsuccessful");
         System.out.println("Problem reading " + fileName );
         System.out.println("Exception: " + e);
         e.printStackTrace();
       }
        //create instance of commonwords
           CommonWords cwords = new CommonWords();  
           //print arraylist
           System.out.println(cwords);   
           //remove arraylist words from hashmap counts
           counts.remove(cwords);       
              
         }
    
    public static void main(String[] args) 
     {
         Train t = new Train();   
     }    
    }
    CommonWords.java
    Code:
    import java.io.*; 
    import java.util.*;
    import javax.swing.JOptionPane;
    import java.util.ArrayList;
    
    public class CommonWords 
    {   
        public FileReader file;
        public StreamTokenizer stc;      
        
    public CommonWords() {   
        
            ArrayList cwords = new ArrayList();
            
            String fileName = "cwords.txt";         
            int tokenType = 0;
            int numberOfTokens = 0;         
            
         try
         {
             Reader cWords = new BufferedReader(new FileReader(fileName));            
             StreamTokenizer stc = new StreamTokenizer(cWords);      
    
             while(stc.nextToken() != StreamTokenizer.TT_EOF)
             {
                   String s;         
                   switch(stc.ttype) 
                   {                   
                        case StreamTokenizer.TT_WORD:
                        
                        s = stc.sval;                     
                        cwords.add(s);                          
                        numberOfTokens++;                     
                        break;
               
                        default: 
                        s = String.valueOf((char)stc.ttype);           
                   }                 
             }
         }
            catch(IOException e) 
               {
                 System.out.println("st.nextToken() unsuccessful");
                 System.out.println("Problem reading " + fileName );
                 System.out.println("Exception: " + e);
                 e.printStackTrace();
               }         
            System.out.println(cwords);
        } 
    }
    im not sure whether the remove statement is going to work now. as counts is a hashmap and cwords is an arraylist of string objects.

    any ideas.

    Thanks

  2. #2
    Join Date
    Nov 2004
    Location
    Norway
    Posts
    1,560
    This does not compile, the string '\' must be hardcoded as '\\' ., the string ''' must be hardcoded as '\''. and some other stuff. Can't see why you go to such length parsing away a bunch of characters, they will not get a match in your logic (as far as I can gather)
    Also I see you have defined 'count' twice on different levels in Train.

    I have boiled down the logic to this, haven't tested it yet though....

    Code:
    import java.io.*;
    import java.util.*;
    import javax.swing.JOptionPane;
    import java.util.ArrayList;
    
    public class Train {
      public HashSet unCommonHS = new HashSet();
      public Train() {
        String fileName = "SPAM.txt";
        String aLine;
        CommonWords cm=new CommonWords();
        HashSet cwSet=cm.getCommonWordsHS();
        if (cm==null) return;
        try {
          BufferedReader bR = new BufferedReader(new FileReader(fileName));
          while((aLine=bR.readLine())!= null) {
            checkForCommonWord (aLine, cwSet, unCommonHS);
          }
          bR.close();
          Object [] ucw=unCommonHS.toArray();
          // report uncommon words, 10 words per line
          System.out.println("List of uncommon words in "+fileName);
          for (int i=0; i<ucw.length; i++) {
            System.out.print(ucw[i]+" ");
            if (i!=0 && (i%10)==0) System.out.println();
          }
        } catch(IOException e) {
          System.out.println("Train Failed");
          e.printStackTrace();
          return;
         }
      }
      /**
       * Here you could add the token logic, if required.
       * @param line
       * @param commonSet
       * @param uncommonHS
       */
      private void checkForCommonWord(String line, HashSet commonSet, HashSet uncommonHS) {
        StringTokenizer st=new StringTokenizer(line," ");
        while (st.hasMoreElements()) {
          Object ob=st.nextElement();
          if (!commonSet.contains(ob)) uncommonHS.add(ob);
        }
      }
      public static void main(String[] args) {
         Train t = new Train();
      }
    }
    
    
    class CommonWords {
      public CommonWords() {}
      public HashSet getCommonWordsHS () {
        HashSet cwSet = new HashSet();
        String fileName = "cwords.txt";
        try {
          BufferedReader bR = new BufferedReader(new FileReader(fileName));
          String aLine=null;
          while((aLine=bR.readLine())!= null) storeWord(cwSet, aLine);
          bR.close();
          return cwSet;
        } catch(IOException e)  {
         System.out.println("CommonWords Failed");
         e.printStackTrace();
         return null;
        }
      }
      private void storeWord(HashSet set, String line) {
        StringTokenizer st=new StringTokenizer(line," ");
        while (st.hasMoreElements()) {
          set.add(st.nextElement());
        }
      }
    
    }
    eschew obfuscation

  3. #3
    Join Date
    Nov 2004
    Posts
    9

    Thumbs up

    Hi,

    Sorry for not replying earlier. Havent been able to find the time to look at this project for a couple of weeks. Thanks for the above it has been a great help. Your logic is ten times better than my first attempt. I always thought that i didnt need to use a stream tokeniser.

    Great Help Thanks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center
 
 
FAQ
Latest Articles
Java
.NET
XML
Database
Enterprise
Questions? Contact us.
C++
Web Development
Wireless
Latest Tips
Open Source


   Development Centers

   -- Android Development Center
   -- Cloud Development Project Center
   -- HTML5 Development Center
   -- Windows Mobile Development Center