Help on Java word search procedures
I am new to this as I have just registered.
But let me get to the reason why I am here. For my final year project at university, I have to develop a desktop document search tool with the use of Java as my programming language. As this is a search tool for documents, I need to be able to do a word search for files on a computer.
Can anyone help me in finding a search procedure that might be useful for my project? If you'd like more information on my project, please contact me. And any help is welcome.
So you want to build a data structure which contains all of the words in the text, with the word as key and location or environment as the value? Or are you going to search each time, scanning the text to find the pattern of interest?
If you are going to build a structure, look at tries or inverted file (the "Google structure").
If you need to look at a function to "pattern match" in the text - the KMP algorithm is fast and efficient and will allow you to find multiple matches.
You will be able to find lots of Internet help, as well as book references, about any one of these topics.
The basis of the program I have to develop is for a user to locate a document (word, notepad, pdf, etc..) file from within their large archive, by performing a search to locate the file (whether it is in the S or D drive (physical HDD)) and show the results of what they have.
However, the search has to scan through the document, if the file name they entered does not exist, so that the relative document can be shown... So therefore the search results have to be displayed in order of relavance (maybe done by 'counting' how many times that word appears in the document(?!)) and listed so the user can see what they have. And then from the results list, I have to make it possible for the user to open the file he would like to edit/view by a click of a button (ie the document will open up from the results list).
Thats the main jist of my project program.
So, break down the components of your program.
You need to access and read files on the disk. So you need a process to access the disk, identify folders, identify files, open them, read them, and move on to the next file, move on to the next folder, etc. ...
Once you have a file open, you are searching it for the pattern you're loking for - that is where the KMP algorithm will work for you
If you get a hit, you'll need to store that reference information in a data structure which will store enough information so that you'll be able to retrieve the file referred to.
Since you're looking for relevance, you then have an algorithm which measures relevance ... and you "filter" your results into a priority queue for your retrieval.
Sounds like a "practical" project which can demonstrate lots of state of the art application - and will be the result of your research, implementation, and testing.
Originally Posted by nspils
So what kind of algorithm is that?! I don't think I know of that one.
It is a "string matching" or "pattern matching" algorithm. Word search, DNA strand base-pair pattern matching, etc. (Knuth-Morris-Pratt is the full name, it is an extension of the Boyer-Moore algorithm which was an improvement over brute-force methods).
You can find lots of explanatory material on the internet - Google "KMP Java" and you'll find stuff. If you can get your hands on Cormen, Leiserson, et al.'s "Introduction to Algorithms" you'll find a discussion of this topic in Chapter 32.
will the KMP algorithm work for different lengths of stringed words? ie, find the exact word. and if so how?
i had a look at the visualisation of the algorithm and it seems to be a single whole string.. or can a whole document be represented as one long string with the spaces not present?
Last edited by kalps001; 10-21-2005 at 08:35 AM.
Properly implemented, it will find whatever sequence you are interested in - be it a run of letters and white space, a word, or a phrase.
Would the Lucene utility work also?!
Lucene is really for web-based information retrieval. I am actually doing the same project in a British university. I have a question: According to the book named Modern Information Retrieval, there are two options in searching. One is scaning the text sequentially and the other building data structures over the text (indices). I prefer the latter one and plan to apply inverted files to indexing. However, KMP algorithm is for pattern matching. Can it be used together with inverted files? If not, should I give up using inverted files?
Last edited by WXY595; 11-16-2005 at 02:19 PM.
The KMP algorithm is a "finding" algorithm. The indexing is building a data structure which will support very fast searches - another "finding" approach. They are two means to the same end.
It is my judgment that if you are going to index once and search much more frequently, the index process will be the more optimal approach. KMP is great for "on the fly" work.
Sorry, what do you mean by on the fly work? Can I use both KMP and inverted files?
Frankly speaking I'm not familiar with Java. So it's really a great challenge for me to do this project. I hope to apply inverted files and clustering approach but don't know how to realise them in Java. Do you have some sample code.
[my thought process here is not specific to Java - it applies tradeoffs of benefits and concerns which are addressed in many languages]
By "on the fly" I meant a process which involves "search for this word or phrase in this text and, if you find it, insert something else [or perform some other action]" which you do infrequently or seek a limited number of matches over the course of the program. The find process is an adjunct to the main work of the program, not the main function. You don't waste time, energy, or memory to create and store a separate structure to search - you search your text file.
The creation of the inverted file takes work and creates a structure which can be accessed over a long period of time. You do many searches, of many words or phrases or patterns, over the entirety of the and so the prime function is the "finding". The cumulative improved performance of each of your many searches justifies the time and power and memory which has been used to create and hold the inverted file.
Thus, my suggestion to you that while you can accomplish what you want with each, they each have an environment in which its benefits outweigh the costs of doing it the other way ...
There are many data structures you can use which can be implemented in any language, including Java: skip lists, tries, b-trees, hashMap, so many others. If you need to do this in Java, perhaps you should look for a good "Algorithms in Java" book or series - Robert Sedgwick's is good and much like his C++ series, as is Goodrich & Tamassa's "Data Structures and Algorithms in Java". [I think Goodrich's book is a better "Java" book because it has a focus on teaching good OOP implementation, including reuse and information hiding - the newest edition has been updated to incorporate the "new stuff" in Java 5.] Much of Goodrich's research (and that of his graduate students and assistants) focuses on large scale data structures and their algorithms, so his book capitalizes on that knowledge. You can look at all sorts of "genetic mapping" literature for algorithms they use -- after all, much of the mapping process is having a "probe" (a sequence of base pairs) and a "text" of genetic sequence, and you create a data structure of that genetic sequence and then try to match the probe (exactly or "best fit") to your genetic sequence.
Last edited by nspils; 11-17-2005 at 08:04 PM.
Would you suggest that I should build the Search Engine in Multi-thread? If so, how to do it?
how to build inverted files and implement semantic relatedness using java?
Last Post: 09-08-2005, 06:38 AM
By Rob Abbe in forum Talk to the Editors
Last Post: 01-13-2003, 02:57 PM
By Lori Piquet in forum Talk to the Editors
Last Post: 10-10-2002, 06:01 AM
By Shantanu in forum Java
Last Post: 12-06-2001, 01:33 PM
Last Post: 07-05-2001, 03:28 PM
Top DevX Stories
Easy Web Services with SQL Server 2005 HTTP Endpoints
JavaOne 2005: Java Platform Roadmap Focuses on Ease of Development, Sun Focuses on the "Free" in F.O.S.S.
Wed Yourself to UML with the Power of Associations
Microsoft to Add AJAX Capabilities to ASP.NET
IBM's Cloudscape Versus MySQL