dcsimg


DevX Home    Today's Headlines   Articles Archive   Tip Bank   Forums   

Results 1 to 9 of 9

Thread: Import flat file text

Hybrid View

  1. #1
    Join Date
    Sep 2007
    Posts
    4

    Import flat file text

    I am trying to build an opensource text importer that will enable a user to import columnar non-delimited data from a flat file, make a guess at column boundaries, allow the user to adjust/change column boundaries visually, allow the user to specify field names and data types for columns, and allow the user to highlight and remove rows and columns they do not want imported. Then the data will be converted into a form suitable for use in a spreadsheet or database and saved in a new file.

    In essence, I'm trying to build a simpler form of the MS Access/Excel "get external data" tool as a Java class/library. Conceptually, this doesn't seem that difficult. The biggest problem I have at the moment is the block separator lines contained in the data files. These files are a result of web based database dumps. Many times there are several header lines that identify the database and such, and then separator lines (usually time or geography markings) heading up each block of columnar data. There doesn't seem to be an easy way to distinguish columnar data from non-columnar data aside from processing a dozen lines or so to guess at column boundaries.

    I'd rather not reinvent the wheel if you know of any opensource tools that can perform any of the functions I've mentioned. I'm also open to any suggestions regarding a direction to take in terms of design. I'm also looking for some suggestions regarding what package(s) to use to build the GUI.

    Thanks in advance.

  2. #2
    Join Date
    Sep 2005
    Location
    istanbul / Turkey
    Posts
    133
    why you dont use CSV file format , http://en.wikipedia.org/wiki/Comma-separated_values
    you can find libraries for CSV ...

  3. #3
    Join Date
    Sep 2007
    Posts
    4

    Legacy files

    I would if I were creating the files. Unfortunately, most of these files are legacy database query reports. Even many of the new ones come in the same form as the database website interface dictates the format of the saved query results, not the user. I have no control over the saved file format.

  4. #4
    Join Date
    Sep 2007
    Posts
    3
    Is there any difference between header lines and data lines?
    Can you, please, give us an example of the file?
    I assume that you should use regular expressions in order to automate the process of division.

  5. #5
    Join Date
    Sep 2007
    Posts
    4
    File may have format like this...

    Transaction ID: 0000000-00-0000
    User: xxxxxxxxxx
    Date: 22-MAY-2007 14:56:23.45

    Wed 13-MAR-2007

    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz

    Thu 14-MAR-2007

    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz

    Fri 15-MAR-2007

    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz

    Opening text is variable in length depending on the type of database report the user asked for. The separator lines between blocks of columnar data, when they exist, generally don't exceed 3 lines. This type of text file is fairly easy to import if you have access to the "Get External Data" tool in MS Access/Excel. In my case, the users I'm trying to help don't have MS Office, which is why I'm building this tool. Many of the files represent the exact same query with the exception of date ranges. Once the user identifies columns in such a file, they want to save those settings for reuse. Using these settings as a filter, I could automatically hide/delete rows that don't fit the column boundaries, making the initial processing of subsequent files easier.

  6. #6
    Join Date
    Sep 2007
    Posts
    4
    File may have format like this...

    Transaction ID: 0000000-00-0000
    User: xxxxxxxxxx
    Date: 22-MAY-2007 14:56:23.45

    Wed 13-MAR-2007

    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz

    Thu 14-MAR-2007

    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz

    Fri 15-MAR-2007

    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz
    ab cdef gh ijklmno pqrs tuvw xyz

    Opening text is variable in length depending on the type of database report the user asked for. The separator lines between blocks of columnar data, when they exist, generally don't exceed 3 lines. This type of text file is fairly easy to import if you have access to the "Get External Data" tool in MS Access/Excel. In my case, the users I'm trying to help don't have MS Office, which is why I'm building this tool. Many of the files represent the exact same query with the exception of date ranges. Once the user identifies columns in such a file, they want to save those settings for reuse. Using these settings as a filter, I could automatically hide/delete rows that don't fit the column boundaries, making the initial processing of subsequent files easier.

  7. #7
    Join Date
    Sep 2007
    Posts
    3
    So, are there any known restrictions for the columnar data? (alphabetical only, alphanumeric, etc.) And is there any possibility to know the number of columns?
    Why not to use RegExp and split function? As I suspect, this will allow you to ignore all the lines with different number of elements.

  8. #8
    Join Date
    Dec 2004
    Location
    San Bernardino County, California
    Posts
    1,468
    Use Scanner. It is made for this kind of a challenge. Break input file into tokens using "CR/NL" combination as the delimiter. Then, creating a new Scanner instance, break each token into sub-tokens using whitespace as the delimiter. Use getNext() or nextInt() or whatever, inserting tabs after each token, until there is no "next" remaining.
    Last edited by nspils; 09-19-2007 at 08:04 PM.

  9. #9
    Join Date
    Sep 2005
    Location
    istanbul / Turkey
    Posts
    133
    hms,
    if it's allowed simply ignoring -umatched amount of words- in a line,
    my words below are irrelevant.

    there should be detailed definition of the output format.
    someway it should handle some situations:
    forexample what if 'data in a column' includes whitespace(which is also 'column seperator') or newline(which is also 'row seperator')...

    wikipedia CSV link explains these special situations, you may adopt.
    if you cant find a library :
    - you may alter an opensource CSV library.
    - a person who has experience with antlr, javacc etc. may help you.(best)
    - you will write your parser by hand.(worst)

    good luck.

Similar Threads

  1. URGENT: FTP Client / Server using RMI
    By lwinn213 in forum Java
    Replies: 2
    Last Post: 10-23-2008, 01:14 PM
  2. Importing text file using schema.ini
    By Kevin in forum VB Classic
    Replies: 3
    Last Post: 12-05-2005, 07:25 PM
  3. Help saving data to text file
    By Asbestos in forum Java
    Replies: 1
    Last Post: 06-06-2005, 10:30 AM
  4. Reading and writing lines from a text file...
    By Jenks in forum VB Classic
    Replies: 6
    Last Post: 05-24-2005, 02:22 PM
  5. Replies: 0
    Last Post: 04-17-2000, 01:33 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center
 
 
FAQ
Latest Articles
Java
.NET
XML
Database
Enterprise
Questions? Contact us.
C++
Web Development
Wireless
Latest Tips
Open Source


   Development Centers

   -- Android Development Center
   -- Cloud Development Project Center
   -- HTML5 Development Center
   -- Windows Mobile Development Center