Autogenerating text based on analyzed text
I'm posting here to see if anybody might be able to give me a push in the right direction. I have created an application that reads a txt-file and analyzes the statistics of character-combinations within it. I store these statistics in a hashMap; with the character-combo as a key, and the number of times each key occurs as value. So if a file contains the text: "dentist" the hashMap stores de, en, nt, ti, st as keys and sets the value of each to 1.
I'm not quite sure how I'm gonna tackle this task, and would really appreciate any ideas you might have.
And simpler is better, of course
What part are you having a problem with?
Oops....seems I somehow left out some critical information. I have gotten the text analysis-part of my app to work. So first the user must choose how many characters to store in each combination...So if a text file contains: "ABCDEF", and the user chose to store 3 letters in each combo; the program stores these keys and values in the hashMap:
and if a combination of letters is found more than once, this is reflected in the value in the hashMap. This all works.
What I need some guidance with now, is that I want the app to automatically generate a new text of user-desired length, based on the statistical occurance of letter combinations. So if the analyzed text is in german/french or whatever; the generated text should resemble that same language. Make any sense?
Could really use some help with how I could go about doing this; thanks
Last edited by KevinBT; 04-22-2007 at 11:28 AM.
There is a vast difference between the two tasks. The clue for this level of complexity is here:
The first part is basically just ripping apart words according to the keword-length parameter. BUT, deciding which combinations of what parts that will make up sensible, say, french-like and pronounceable words goes far beyond the scope of (almost) any programming language.
the generated text should resemble that same language
I can think of many ways to pick keywords and combining them into longer char sequences according to some random & statistical criteria, but the additional logic that is required to acieve "resemblesness" will crave long reference lists of plausible/useful phonetics, - specific to the lanuage in question.
I doubt that this was very helpful, but its as far as I can go.
Thanks....It's really not that important that the generated text resembles the analyzed language, I'm just making this for myself; school related. But you say you can think of many ways to make the program generate some kind of text based on the statistics I have saved in the hashMap. Any hints?
The hashMap is great for retrieval but how are you going to take advantage of your statistics? Are you keeping any "environment" information - do certain combinations occur only within a short distance - or a long distance - from other combinations? But for your reconstruction portion of the project it sounds to me that you would be thinking about using a priority queue, or even a splay tree, as a more effective/useful/meaningful storage for that "reconstruction" part.
Have you considered making a pool of combinations, with the observed frequency of occurrence indicating the number of times that combination appears in the pool - 10^6 character combinations, if "ae" occurs 10% of the time, then you'd put 10^5 "ae"'s into the pool - and then recombining at random? Making your combinations into one long string and then breaking them at all the possible locations except that combinations could not be broken [unless your analysis took out whitespace before determining character occurrence combinations].
Are you doing any evaluation of what are "valid" combinations of characters - what are real phonemes for the language rather than character combinations? If you are truly doing language analysis, you should be looking for that kind of guidance and implementation.
Sounds like 10^6 monkeys with typewriters churning out Shakespeare ... or Hemingway.
Last edited by nspils; 04-22-2007 at 01:38 PM.
OK, here are some hints
1: The generating of words is done in a loop that has generated one word on each completion. I.e. the number of times this loop is completed is the number of words in the resulting text.
2: The number of times this loop loops (i.e. how long this word gets) is a random number picked within reasonable limits, - you decide.
3: On each rotation this loop calls a method that will return a keyword that is appended to the current new word.
4: This method picks keywords randomly in a "weighted" way, and this is where the fun begins.
In order to do this you can use an array with keywords and pick the keywords using a randomly generated index. Check the Random class to see how you can assure that the random idex is not out of bounds, - higher than the array length-1.
But this random picking should be weighted also, - that is, a keyword with twice the amount of occurrences in the original text than another keyword should have (the chance of statistically) twice as many hits as the latter, mkay ?
Here is one way to do that:
Get the sum of the occurrence values for all the keywords.
Allocate a String array with a size equal to that sum.
Then loop through the enumeration of the hashmap while at the same time you store the keyword values in the String array. The storing is done as follows (e.g.):
The storing loop reaches the keyword "CDE" with 25 occurrences. This keyword is then stored in (each of) the next 25 slots of the string array. The next string is "DEF" with 12 occurrences and the string "DEF" is stored in the next 12 slots of the string array...
If the sum mentioned above is, say, 2345, you then use the Random class to generate a random number between 0 and 2344 and use that for picking-index.
5: The word generating loop also performs sanity checks to avoid keyword combinations that lead to many consecutive vowels or consonants.
Don't ask me to code some of this for you, cause I won't. From your post I assume that this is a walk in the park for you.
Good luck !
Last edited by sjalle; 04-22-2007 at 02:06 PM.
Reason: Typo typo
Thank you both....you've been very helpful
And don't worry; I won't ask you to code anything for me
By pyronate in forum Java
Last Post: 06-06-2006, 02:28 PM
By Kevin in forum VB Classic
Last Post: 12-05-2005, 07:25 PM
By Karen in forum VB Classic
Last Post: 08-21-2002, 03:59 PM
Last Post: 08-30-2001, 12:45 PM
By George Gilbert in forum vb.announcements
Last Post: 08-19-2001, 12:34 PM
Top DevX Stories
Easy Web Services with SQL Server 2005 HTTP Endpoints
JavaOne 2005: Java Platform Roadmap Focuses on Ease of Development, Sun Focuses on the "Free" in F.O.S.S.
Wed Yourself to UML with the Power of Associations
Microsoft to Add AJAX Capabilities to ASP.NET
IBM's Cloudscape Versus MySQL