How to detect UTF-8-based encoded strings

DevX Home    Today's Headlines   Articles Archive   Tip Bank   Forums   

Results 1 to 1 of 1

Thread: How to detect UTF-8-based encoded strings

Hybrid View

  1. #1
    Join Date
    Apr 2011

    How to detect UTF-8-based encoded strings


    A customer of asked us to build him a multi-language based support VB6 scraper, for which we had the need to detect UTF-8 based encoded strings to decode it later for proper displaying in application UI. It's necessary to point out that this need arises based on VB6 limitations to natively support UTF-8 in its controls, contrary to what it happens in .NET where you can tell a control that it should expect UTF-8 encoding. VB6 natively supports ISO 8859-1 and/or Windows-1252 encodings only, for which textboxes, dropdowns, listview controls, others can't be defined to natively support/expect UTF-8 as you can do in .NET considering what we just explained; so we would see weird symbols such as é, è among others, making it a whole mess at the time of displaying.

    So, next function contains whole UTF-8 encoded punctuation marks and symbols from languages like Spanish, Italian, German, Portuguese, French and others, based on an excellent UTF-8 based list we got from this link - Ref.

    Basically, the function compares if each and one of the listed UTF-8 encoded sentences, separated by | (pipe) are found in our passed string making a substring search first. Whether it's not found, it makes an alternative ASCII value based search to get a match. Say, a string like "Societ" (Society in english) would return FALSE through calling isUTF8("Societ") while it would return TRUE when calling isUTF8("Societˆ") since ˆ is the UTF-8 encoded representation of .

    Once you got it TRUE or FALSE, you can decode the string through DecodeUTF8() function for properly displaying it, a function we found somewhere else time ago and also included in this post.

    Function isUTF8(ByVal ptstr As String)
        Dim tUTFencoded As String
        Dim tUTFencodedaux
        Dim tUTFencodedASCII As String
        Dim ptstrASCII As String
        Dim iaux, iaux2 As Integer
        Dim ffound As Boolean
        ffound = False
        ptstrASCII = ""
        For iaux = 1 To Len(ptstr)
            ptstrASCII = ptstrASCII & Asc(Mid(ptstr, iaux, 1)) & "|"
        tUTFencoded = "„|…|‡|‰|‘|–|Œ|á||â|ä|ã|å|ç|é|è|ê|ë|*|ì|î|ï|ñ|ó|ò|ô|ö|õ|ú|ù|û|ü|€|°|¢|£|§|€|¶|Ÿ|®|©|„|´|¨|‰|†|˜|ˆž|±|‰|‰|¥|µ|ˆ‚|ˆ‘|ˆ|€|ˆ|ª|º|Ω|æ|ø|¿|¡|¬|ˆš|’|‰ˆ|ˆ†|«|»|€||€|ƒ|•|’|“|€“|€”|€œ|€|€˜|€™|÷|—Š|ÿ|Ÿ|„|‚|€|€|fi|‚|€|·|€š|€ž|€|‚|š|Á|‹|ˆ|Í|Ž|Ï|Œ|“|”||’|š|›|™|ı|†|œ|¯|˜|™|š|¸|˝|›|‡" & _
                    "|š|¦|²|³|¹|¼|½|¾|Ð|—|Ý|ž|ð|ý|þ" & _
        tUTFencodedaux = Split(tUTFencoded, "|")
        If UBound(tUTFencodedaux) > 0 Then
            iaux = 0
            Do While Not ffound And Not iaux > UBound(tUTFencodedaux)
                If InStr(1, ptstr, tUTFencodedaux(iaux), vbTextCompare) > 0 Then
                    ffound = True
                End If
                If Not ffound Then
                    'ASCII numeric search
                    tUTFencodedASCII = ""
                    For iaux2 = 1 To Len(tUTFencodedaux(iaux))
                        'gets ASCII numeric sequence
                        tUTFencodedASCII = tUTFencodedASCII & Asc(Mid(tUTFencodedaux(iaux), iaux2, 1)) & "|"
                    'tUTFencodedASCII = Left(tUTFencodedASCII, Len(tUTFencodedASCII) - 1)
                    'compares numeric sequences
                    If InStr(1, ptstrASCII, tUTFencodedASCII) > 0 Then
                        ffound = True
                    End If
                End If
                iaux = iaux + 1
        End If
        isUTF8 = ffound
    End Function
    Function DecodeUTF8(s)
      Dim i
      Dim c
      Dim n
      s = s & " "
      i = 1
      Do While i <= Len(s)
        c = Asc(Mid(s, i, 1))
        If c And &H80 Then
          n = 1
          Do While i + n < Len(s)
            If (Asc(Mid(s, i + n, 1)) And &HC0) <> &H80 Then
              Exit Do
            End If
            n = n + 1
          If n = 2 And ((c And &HE0) = &HC0) Then
            c = Asc(Mid(s, i + 1, 1)) + &H40 * (c And &H1)
            c = 191
          End If
          s = Left(s, i - 1) + Chr(c) + Mid(s, i + n)
        End If
        i = i + 1
      DecodeUTF8 = s
    End Function
    Hope it helps


    Diego Sendra

    *Please note you have to download the function from considering some of the UTF encoded symbols in tUTFencoded variable were lost/deleted at the time of copy/pasting the code into this thread
    Last edited by diebythesword76; 06-24-2013 at 09:27 PM.

Similar Threads

  1. rpc/encoded array issue in .NET
    By DotnetRaji in forum .NET
    Replies: 0
    Last Post: 05-11-2009, 06:22 AM
  2. Replies: 1
    Last Post: 10-29-2007, 10:06 AM
  3. encoded image problem xsl
    By ksrs_kak in forum XML
    Replies: 0
    Last Post: 06-20-2007, 12:29 PM
  4. encoded image problem xsl
    By ksrs_kak in forum .NET
    Replies: 0
    Last Post: 06-20-2007, 12:27 PM
  5. Encoded Quiktime movie with VB 6
    By anonymous in forum VB Classic
    Replies: 0
    Last Post: 05-12-2003, 06:38 PM

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
HTML5 Development Center
Latest Articles
Questions? Contact us.
Web Development
Latest Tips
Open Source

   Development Centers

   -- Android Development Center
   -- Cloud Development Project Center
   -- HTML5 Development Center
   -- Windows Mobile Development Center