-
StreamTokeniser trouble
Hi all,
im trying to build a text classification program that reads in a text file, tokenises it and stores it in a hashmap. Along the way i would like to instantiate another class which reads a text file full of commonwords. I would then like to remove these common words from the initial hashmap. so problems at the moment are that i cant instantiate the arraylist and cannot remove the commonwords from the hashmap. I have set up two classes one called CommonWords.java and the other called Train.java.
Train.java
Code:
import java.io.*;
import java.util.*;
import javax.swing.JOptionPane;
import java.util.ArrayList;
public class Train {
public FileReader file;
public StreamTokenizer st;
public HashMap counts = new HashMap();
//public HashMap cwords = new HashMap();
//public Hashtable cword;
/** Creates a new instance of train */
public Train() {
ArrayList cw = new ArrayList();
String fileName = "SPAM.txt";
String data;
counts = new HashMap();
//cwords = new HashMap();
int tokenType = 0;
int numberOfTokens = 0;
try
{
FileReader file = new FileReader(fileName);
StreamTokenizer st = new StreamTokenizer(new BufferedReader(file));
st.ordinaryChar('!');
st.ordinaryChar('$');
st.ordinaryChar('"');
st.whitespaceChars('/','/');
st.whitespaceChars('\\','\\');
st.whitespaceChars('.','.');
st.whitespaceChars(',',',');
st.whitespaceChars(';',';');
st.whitespaceChars(':',':');
st.whitespaceChars('=','=');
st.whitespaceChars('\'','\'');
st.whitespaceChars('`','`');
st.whitespaceChars('[',']');
while(st.nextToken() != StreamTokenizer.TT_EOF) {
String s;
switch(st.ttype) {
case StreamTokenizer.TT_WORD:
s = st.sval; // Already a String
// System.out.println("Token Extracted = " + st.sval);
st.lowerCaseMode(true);
numberOfTokens++;
break;
default:
s = String.valueOf((char)st.ttype);
}
if(counts.containsKey(s))
((Counter)counts.get(s)).increment();
else
counts.put(s, new Counter());
}
}
catch(IOException e)
{
System.out.println("st.nextToken() unsuccessful");
System.out.println("Problem reading " + fileName );
System.out.println("Exception: " + e);
e.printStackTrace();
}
//create instance of commonwords
CommonWords cwords = new CommonWords();
//print arraylist
System.out.println(cwords);
//remove arraylist words from hashmap counts
counts.remove(cwords);
}
public static void main(String[] args)
{
Train t = new Train();
}
}
CommonWords.java
Code:
import java.io.*;
import java.util.*;
import javax.swing.JOptionPane;
import java.util.ArrayList;
public class CommonWords
{
public FileReader file;
public StreamTokenizer stc;
public CommonWords() {
ArrayList cwords = new ArrayList();
String fileName = "cwords.txt";
int tokenType = 0;
int numberOfTokens = 0;
try
{
Reader cWords = new BufferedReader(new FileReader(fileName));
StreamTokenizer stc = new StreamTokenizer(cWords);
while(stc.nextToken() != StreamTokenizer.TT_EOF)
{
String s;
switch(stc.ttype)
{
case StreamTokenizer.TT_WORD:
s = stc.sval;
cwords.add(s);
numberOfTokens++;
break;
default:
s = String.valueOf((char)stc.ttype);
}
}
}
catch(IOException e)
{
System.out.println("st.nextToken() unsuccessful");
System.out.println("Problem reading " + fileName );
System.out.println("Exception: " + e);
e.printStackTrace();
}
System.out.println(cwords);
}
}
im not sure whether the remove statement is going to work now. as counts is a hashmap and cwords is an arraylist of string objects.
any ideas.
Thanks
-
This does not compile, the string '\' must be hardcoded as '\\' ., the string ''' must be hardcoded as '\''. and some other stuff. Can't see why you go to such length parsing away a bunch of characters, they will not get a match in your logic (as far as I can gather)
Also I see you have defined 'count' twice on different levels in Train.
I have boiled down the logic to this, haven't tested it yet though....
Code:
import java.io.*;
import java.util.*;
import javax.swing.JOptionPane;
import java.util.ArrayList;
public class Train {
public HashSet unCommonHS = new HashSet();
public Train() {
String fileName = "SPAM.txt";
String aLine;
CommonWords cm=new CommonWords();
HashSet cwSet=cm.getCommonWordsHS();
if (cm==null) return;
try {
BufferedReader bR = new BufferedReader(new FileReader(fileName));
while((aLine=bR.readLine())!= null) {
checkForCommonWord (aLine, cwSet, unCommonHS);
}
bR.close();
Object [] ucw=unCommonHS.toArray();
// report uncommon words, 10 words per line
System.out.println("List of uncommon words in "+fileName);
for (int i=0; i<ucw.length; i++) {
System.out.print(ucw[i]+" ");
if (i!=0 && (i%10)==0) System.out.println();
}
} catch(IOException e) {
System.out.println("Train Failed");
e.printStackTrace();
return;
}
}
/**
* Here you could add the token logic, if required.
* @param line
* @param commonSet
* @param uncommonHS
*/
private void checkForCommonWord(String line, HashSet commonSet, HashSet uncommonHS) {
StringTokenizer st=new StringTokenizer(line," ");
while (st.hasMoreElements()) {
Object ob=st.nextElement();
if (!commonSet.contains(ob)) uncommonHS.add(ob);
}
}
public static void main(String[] args) {
Train t = new Train();
}
}
class CommonWords {
public CommonWords() {}
public HashSet getCommonWordsHS () {
HashSet cwSet = new HashSet();
String fileName = "cwords.txt";
try {
BufferedReader bR = new BufferedReader(new FileReader(fileName));
String aLine=null;
while((aLine=bR.readLine())!= null) storeWord(cwSet, aLine);
bR.close();
return cwSet;
} catch(IOException e) {
System.out.println("CommonWords Failed");
e.printStackTrace();
return null;
}
}
private void storeWord(HashSet set, String line) {
StringTokenizer st=new StringTokenizer(line," ");
while (st.hasMoreElements()) {
set.add(st.nextElement());
}
}
}
eschew obfuscation
-
Hi,
Sorry for not replying earlier. Havent been able to find the time to look at this project for a couple of weeks. Thanks for the above it has been a great help. Your logic is ten times better than my first attempt. I always thought that i didnt need to use a stream tokeniser.
Great Help Thanks
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
Forum Rules
|
Top DevX Stories
Easy Web Services with SQL Server 2005 HTTP Endpoints
JavaOne 2005: Java Platform Roadmap Focuses on Ease of Development, Sun Focuses on the "Free" in F.O.S.S.
Wed Yourself to UML with the Power of Associations
Microsoft to Add AJAX Capabilities to ASP.NET
IBM's Cloudscape Versus MySQL
|
Bookmarks