How to detect UTF-8-based encoded strings


DevX Home    Today's Headlines   Articles Archive   Tip Bank   Forums   

Results 1 to 1 of 1

Thread: How to detect UTF-8-based encoded strings

Hybrid View

  1. #1
    Join Date
    Apr 2011
    Posts
    5

    How to detect UTF-8-based encoded strings

    Hi

    A customer of asked us to build him a multi-language based support VB6 scraper, for which we had the need to detect UTF-8 based encoded strings to decode it later for proper displaying in application UI. It's necessary to point out that this need arises based on VB6 limitations to natively support UTF-8 in its controls, contrary to what it happens in .NET where you can tell a control that it should expect UTF-8 encoding. VB6 natively supports ISO 8859-1 and/or Windows-1252 encodings only, for which textboxes, dropdowns, listview controls, others can't be defined to natively support/expect UTF-8 as you can do in .NET considering what we just explained; so we would see weird symbols such as é, è among others, making it a whole mess at the time of displaying.

    So, next function contains whole UTF-8 encoded punctuation marks and symbols from languages like Spanish, Italian, German, Portuguese, French and others, based on an excellent UTF-8 based list we got from this link - Ref. http://home.telfort.nl/~t876506/utf8tbl.html

    Basically, the function compares if each and one of the listed UTF-8 encoded sentences, separated by | (pipe) are found in our passed string making a substring search first. Whether it's not found, it makes an alternative ASCII value based search to get a match. Say, a string like "Societ" (Society in english) would return FALSE through calling isUTF8("Societ") while it would return TRUE when calling isUTF8("Societˆ") since ˆ is the UTF-8 encoded representation of .

    Once you got it TRUE or FALSE, you can decode the string through DecodeUTF8() function for properly displaying it, a function we found somewhere else time ago and also included in this post.


    Code:
    Function isUTF8(ByVal ptstr As String)
        Dim tUTFencoded As String
        Dim tUTFencodedaux
        Dim tUTFencodedASCII As String
        Dim ptstrASCII As String
        Dim iaux, iaux2 As Integer
        Dim ffound As Boolean
        
        ffound = False
        ptstrASCII = ""
        
        For iaux = 1 To Len(ptstr)
            ptstrASCII = ptstrASCII & Asc(Mid(ptstr, iaux, 1)) & "|"
        Next
            
        tUTFencoded = "„|…|‡|‰|‘|–|Œ|á||â|ä|ã|å|ç|é|è|ê|ë|*|ì|î|ï|ñ|ó|ò|ô|ö|õ|ú|ù|û|ü|€|°|¢|£|§|€|¶|Ÿ|®|©|„|´|¨|‰|†|˜|ˆž|±|‰|‰|¥|µ|ˆ‚|ˆ‘|ˆ|€|ˆ|ª|º|Ω|æ|ø|¿|¡|¬|ˆš|’|‰ˆ|ˆ†|«|»|€||€|ƒ|•|’|“|€“|€”|€œ|€|€˜|€™|÷|—Š|ÿ|Ÿ|„|‚|€|€|fi|‚|€|·|€š|€ž|€|‚|š|Á|‹|ˆ|Í|Ž|Ï|Œ|“|”||’|š|›|™|ı|†|œ|¯|˜|™|š|¸|˝|›|‡" & _
                    "|š|¦|²|³|¹|¼|½|¾|Ð|—|Ý|ž|ð|ý|þ" & _
                    "‰|ˆž|‰|‰|ˆ‚|ˆ‘|ˆ|€|ˆ|Ω|ˆš|‰ˆ|ˆ†|—Š|„|fi|‚||ı|˜|™|š|˝|›|‡"
    
        tUTFencodedaux = Split(tUTFencoded, "|")
        If UBound(tUTFencodedaux) > 0 Then
            iaux = 0
            Do While Not ffound And Not iaux > UBound(tUTFencodedaux)
                If InStr(1, ptstr, tUTFencodedaux(iaux), vbTextCompare) > 0 Then
                    ffound = True
                End If
                
                If Not ffound Then
                    'ASCII numeric search
                    tUTFencodedASCII = ""
                    For iaux2 = 1 To Len(tUTFencodedaux(iaux))
                        'gets ASCII numeric sequence
                        tUTFencodedASCII = tUTFencodedASCII & Asc(Mid(tUTFencodedaux(iaux), iaux2, 1)) & "|"
                    Next
                    'tUTFencodedASCII = Left(tUTFencodedASCII, Len(tUTFencodedASCII) - 1)
                    
                    'compares numeric sequences
                    If InStr(1, ptstrASCII, tUTFencodedASCII) > 0 Then
                        ffound = True
                    End If
                End If
                
                iaux = iaux + 1
            Loop
        End If
        
        isUTF8 = ffound
    End Function
    
    Function DecodeUTF8(s)
      Dim i
      Dim c
      Dim n
      
      s = s & " "
    
      i = 1
      Do While i <= Len(s)
        c = Asc(Mid(s, i, 1))
        If c And &H80 Then
          n = 1
          Do While i + n < Len(s)
            If (Asc(Mid(s, i + n, 1)) And &HC0) <> &H80 Then
              Exit Do
            End If
            n = n + 1
          Loop
          If n = 2 And ((c And &HE0) = &HC0) Then
            c = Asc(Mid(s, i + 1, 1)) + &H40 * (c And &H1)
          Else
            c = 191
          End If
          s = Left(s, i - 1) + Chr(c) + Mid(s, i + n)
        End If
        i = i + 1
      Loop
      DecodeUTF8 = s
    End Function
    Hope it helps

    Regards

    Diego Sendra
    e-mail: contact@diegosendra.com
    http://www.diegosendra.com

    *Please note you have to download the function from http://www.diegosendra.com/samples/c...VB6_isUTF8.txt considering some of the UTF encoded symbols in tUTFencoded variable were lost/deleted at the time of copy/pasting the code into this thread
    Last edited by diebythesword76; 06-24-2013 at 08:27 PM.

Similar Threads

  1. rpc/encoded array issue in .NET
    By DotnetRaji in forum .NET
    Replies: 0
    Last Post: 05-11-2009, 05:22 AM
  2. Replies: 1
    Last Post: 10-29-2007, 09:06 AM
  3. encoded image problem xsl
    By ksrs_kak in forum XML
    Replies: 0
    Last Post: 06-20-2007, 11:29 AM
  4. encoded image problem xsl
    By ksrs_kak in forum .NET
    Replies: 0
    Last Post: 06-20-2007, 11:27 AM
  5. Encoded Quiktime movie with VB 6
    By anonymous in forum VB Classic
    Replies: 0
    Last Post: 05-12-2003, 05:38 PM

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center
 
 
FAQ
Latest Articles
Java
.NET
XML
Database
Enterprise
Questions? Contact us.
C++
Web Development
Wireless
Latest Tips
Open Source


   Development Centers

   -- Android Development Center
   -- Cloud Development Project Center
   -- HTML5 Development Center
   -- Windows Mobile Development Center