Q: Using StringReader to monitor a text file
I have a very interesting finding, at least for me!
I use StreamReader to monitor a text file. Here are some codes used in a function that keeps reading the tail of the file if there are any new strings are added to it.
This function is used to monitor a text file: myLog.txt, as an example. At first, the file contains following lines:
// my stream reader
StreamReader myReader = new StreamReader("myLog.txt");
// Some codes of the function to read chars from a file till
// to the end.
//seek to position. m_pos is class level var, and initially as 0.
// in a loop to read a char till the end of file
char c = new char;
myReader.Read(c, 0, 1);
m_pos++; // keep track of the current position in the file.
The function returns the correct strings and they are displayed in a RichTextBox(RTB) OK. After the call, the position(m_pos) is 20. The I appended "Line 2." as a new line to the file.
It works fine if the file is a text file without any BOM(byte order mark). However, if I save the file as a Unicode text file(I need to support text files with BOM), the first call(start from 0) returns the correct the string, but the position is still 20. Then problem comes: the send call(start after 20) does not work correctly. It reads from ":", not the start of "Line 2".
What I found is that as a text file with BOM like Unicode, each ASCII char takes 2 bytes like this: \0\ascii_code. The first time(position at 0) when the function is called, Read() function reads only ASCII chars out(no \0 preceeding at all!), hence the postion is 20(the BOM 2 bytes are skipped). However, in the second call(seek to 20 then start to read), the Read() reads \0 and continue reading byte by byte!
I am not sure how I can handle the case of text file with BOM(Unicode, UTF8 and UBE). Should I detect BOM first and adjust position increatement based on the BOM for the first call (starting from 0)? And in the remaining calls(seek to >0), I have to skip all the \0s? and return all none \0 chars?
Any othe better solutions?
By the way, if I convert a plain text string embeded \0 to a RTF string and then place it to a RTB, the string chars after \0 do not show up at all. It looks like that it has been trimmed out.
I don't have any information about the behavior you're seeing, but I notice that there are numerous C# "tail" implementation available on the Web: http://www.google.com/search?q=c%23+tail . Maybe one of them will provide a workaround?
Please post questions to the forums, where others may benefit.
I do not offer free assistance by e-mail. Thank you!
I tried a couple of examples. It works for files without BOM, but not files with Unicode, and Unicode big endian BOMs (you can save text file in notepad with these types).
In other words, if there is BOM header in the text file, tail reading (seek position first) and then Read() with .Net StreamReader() may read embeded \0 chars. String with embeded \0 will be trimmed off when they are displayed in Text or RichTextBox.
My solution is that to detect BOM first. If there is BOM, just skip it with seek. This will guarantee Read() reads byte by byte. Then I check if the char is \0, I just skip it and only add none \0 to the result string. It works OK for all types of text files saved in notepad.
So, if you work with text file, be aware BOM in .Net when you read from middle of the file.
It is a simple encoding problem.
Each character takes 1 byte in ANSI and 2 bytes in Unicode.
StreamReader.Read() is a smart method which can determine the encoding type, read the corresponding number of bytes, and return them as a character.
So, no matter what encoding type is used, the 1st call should return "m_pos=20"
The problem is that "m_pos" is the character count, not the byte count.
Passing "m_pos" as byte offset parameter to Seek() gives error while reading Unicode.
To solve it, you may obtain the encoding type by myReader.CurrentEncoding
If Unicode is detected, apply "m_pos += 2" rather than "m_pos++".
Last edited by oupoi; 08-15-2006 at 09:32 AM.
I think the CurrentEncoding always returns Unicode, no matter what I have in the text file or with/without BOM (byte order mark). I tried to use notepad to save a text file in all different encoding types. The CurrentEncoding is Unicode for all types.
Sorry for missing some important points.
StreamReader won't detect the encoding type before read.
If you simply prompt the CurrentEncoding.EncodingName you will get "Unicode (UTF-8)", which is the default type.
However, if you prompt the CurrentEncoding.EncodingName after the Read() command, ANSI text file will give you "Unicode (UTF-8)" and Unicode text file will give "Unicode".
Moreover, the CurrentEncoding.GetByteCount() function can tell you the number of byte is used for one (or more) character directly.
There is another function, CurrentEncoding.GetMaxByteCount(), seems to be more suitable. But it returns 4bytes/character for UTF-8, which is not really the case of ANSI.
Unfortunately, .NET seems can't distinguish ANSI and the real UTF-8 format. So please use Unicode for BOM.
If it is a closed enviornment with only Unicode file, no need to worry about those stuffs, just change "m_pos++" to "m_pos+=2".
Last edited by oupoi; 08-16-2006 at 11:24 PM.
Indeed, if you don't mind changing your logic flow, keep tracking on the StreamReader.BaseStream.Length on every function call will be better than using an incremental variable inside the reading loop.
By jase_dukerider in forum C++
Last Post: 04-14-2005, 08:48 PM
By K. Soe in forum VB Classic
Last Post: 03-08-2003, 07:25 PM
By Larry Rebich in forum vb.announcements
Last Post: 04-02-2002, 11:45 PM
By Hian Chew in forum VB Classic
Last Post: 03-07-2001, 01:07 PM
By deborah in forum authorevents.kurata
Last Post: 04-17-2000, 02:33 PM
Top DevX Stories
Easy Web Services with SQL Server 2005 HTTP Endpoints
JavaOne 2005: Java Platform Roadmap Focuses on Ease of Development, Sun Focuses on the "Free" in F.O.S.S.
Wed Yourself to UML with the Power of Associations
Microsoft to Add AJAX Capabilities to ASP.NET
IBM's Cloudscape Versus MySQL