Autogenerating text based on analyzed text


DevX Home    Today's Headlines   Articles Archive   Tip Bank   Forums   

Results 1 to 8 of 8

Thread: Autogenerating text based on analyzed text

  1. #1
    Join Date
    Apr 2007
    Posts
    4

    Autogenerating text based on analyzed text

    Hi, all....

    I'm posting here to see if anybody might be able to give me a push in the right direction. I have created an application that reads a txt-file and analyzes the statistics of character-combinations within it. I store these statistics in a hashMap; with the character-combo as a key, and the number of times each key occurs as value. So if a file contains the text: "dentist" the hashMap stores de, en, nt, ti, st as keys and sets the value of each to 1.

    I'm not quite sure how I'm gonna tackle this task, and would really appreciate any ideas you might have.
    And simpler is better, of course

    Thanks

  2. #2
    Join Date
    Dec 2004
    Location
    San Bernardino County, California
    Posts
    1,468
    What part are you having a problem with?

  3. #3
    Join Date
    Apr 2007
    Posts
    4
    Oops....seems I somehow left out some critical information. I have gotten the text analysis-part of my app to work. So first the user must choose how many characters to store in each combination...So if a text file contains: "ABCDEF", and the user chose to store 3 letters in each combo; the program stores these keys and values in the hashMap:

    KEY: VALUE:
    ABC 1
    BCD 1
    CDE 1
    DEF 1

    and if a combination of letters is found more than once, this is reflected in the value in the hashMap. This all works.

    What I need some guidance with now, is that I want the app to automatically generate a new text of user-desired length, based on the statistical occurance of letter combinations. So if the analyzed text is in german/french or whatever; the generated text should resemble that same language. Make any sense?

    Could really use some help with how I could go about doing this; thanks
    Last edited by KevinBT; 04-22-2007 at 10:28 AM.

  4. #4
    Join Date
    Nov 2004
    Location
    Norway
    Posts
    1,560
    There is a vast difference between the two tasks. The clue for this level of complexity is here:
    the generated text should resemble that same language
    The first part is basically just ripping apart words according to the keword-length parameter. BUT, deciding which combinations of what parts that will make up sensible, say, french-like and pronounceable words goes far beyond the scope of (almost) any programming language.
    I can think of many ways to pick keywords and combining them into longer char sequences according to some random & statistical criteria, but the additional logic that is required to acieve "resemblesness" will crave long reference lists of plausible/useful phonetics, - specific to the lanuage in question.

    I doubt that this was very helpful, but its as far as I can go.
    eschew obfuscation

  5. #5
    Join Date
    Apr 2007
    Posts
    4
    Thanks....It's really not that important that the generated text resembles the analyzed language, I'm just making this for myself; school related. But you say you can think of many ways to make the program generate some kind of text based on the statistics I have saved in the hashMap. Any hints?

  6. #6
    Join Date
    Dec 2004
    Location
    San Bernardino County, California
    Posts
    1,468
    The hashMap is great for retrieval but how are you going to take advantage of your statistics? Are you keeping any "environment" information - do certain combinations occur only within a short distance - or a long distance - from other combinations? But for your reconstruction portion of the project it sounds to me that you would be thinking about using a priority queue, or even a splay tree, as a more effective/useful/meaningful storage for that "reconstruction" part.

    Have you considered making a pool of combinations, with the observed frequency of occurrence indicating the number of times that combination appears in the pool - 10^6 character combinations, if "ae" occurs 10% of the time, then you'd put 10^5 "ae"'s into the pool - and then recombining at random? Making your combinations into one long string and then breaking them at all the possible locations except that combinations could not be broken [unless your analysis took out whitespace before determining character occurrence combinations].

    Are you doing any evaluation of what are "valid" combinations of characters - what are real phonemes for the language rather than character combinations? If you are truly doing language analysis, you should be looking for that kind of guidance and implementation.

    Sounds like 10^6 monkeys with typewriters churning out Shakespeare ... or Hemingway.
    Last edited by nspils; 04-22-2007 at 12:38 PM.

  7. #7
    Join Date
    Nov 2004
    Location
    Norway
    Posts
    1,560

    OK, here are some hints

    1: The generating of words is done in a loop that has generated one word on each completion. I.e. the number of times this loop is completed is the number of words in the resulting text.

    2: The number of times this loop loops (i.e. how long this word gets) is a random number picked within reasonable limits, - you decide.

    3: On each rotation this loop calls a method that will return a keyword that is appended to the current new word.

    4: This method picks keywords randomly in a "weighted" way, and this is where the fun begins.

    In order to do this you can use an array with keywords and pick the keywords using a randomly generated index. Check the Random class to see how you can assure that the random idex is not out of bounds, - higher than the array length-1.

    But this random picking should be weighted also, - that is, a keyword with twice the amount of occurrences in the original text than another keyword should have (the chance of statistically) twice as many hits as the latter, mkay ?

    Here is one way to do that:

    Get the sum of the occurrence values for all the keywords.
    Allocate a String array with a size equal to that sum.
    Then loop through the enumeration of the hashmap while at the same time you store the keyword values in the String array. The storing is done as follows (e.g.):
    The storing loop reaches the keyword "CDE" with 25 occurrences. This keyword is then stored in (each of) the next 25 slots of the string array. The next string is "DEF" with 12 occurrences and the string "DEF" is stored in the next 12 slots of the string array...

    If the sum mentioned above is, say, 2345, you then use the Random class to generate a random number between 0 and 2344 and use that for picking-index.

    5: The word generating loop also performs sanity checks to avoid keyword combinations that lead to many consecutive vowels or consonants.


    Don't ask me to code some of this for you, cause I won't. From your post I assume that this is a walk in the park for you.

    Good luck !
    Last edited by sjalle; 04-22-2007 at 01:06 PM. Reason: Typo typo
    eschew obfuscation

  8. #8
    Join Date
    Apr 2007
    Posts
    4
    Thank you both....you've been very helpful
    And don't worry; I won't ask you to code anything for me

Similar Threads

  1. text based rpg?
    By pyronate in forum Java
    Replies: 12
    Last Post: 06-06-2006, 01:28 PM
  2. Importing text file using schema.ini
    By Kevin in forum VB Classic
    Replies: 3
    Last Post: 12-05-2005, 06:25 PM
  3. Replies: 0
    Last Post: 08-21-2002, 02:59 PM
  4. Script for scrolling
    By Mark in forum Web
    Replies: 3
    Last Post: 08-30-2001, 11:45 AM
  5. Double Text 1.0
    By George Gilbert in forum vb.announcements
    Replies: 0
    Last Post: 08-19-2001, 11:34 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center
 
 
FAQ
Latest Articles
Java
.NET
XML
Database
Enterprise
Questions? Contact us.
C++
Web Development
Wireless
Latest Tips
Open Source


   Development Centers

   -- Android Development Center
   -- Cloud Development Project Center
   -- HTML5 Development Center
   -- Windows Mobile Development Center