MartyConnelly
martyconnelly at shaw.ca
Thu Jan 26 01:41:45 CST 2006
Yup I have run into that character encoding and character set problems
with XML files a few years back,
I remember discussing it with Pedro Gill a few years back
http://www.geocities.com/pmpg98_pt/CharacterEncoding.html
But I got around the problem he had with double BOM's and changing
character encodings.
Most XML parsers check the BOM of an input file first.
I suppose you could use something like this code below
to change a text file using ADO streams from say ISO-8859-5 to Unicode
or even Chinese Big 5
or say UTF-8 to UTF-16.
You need a hex editor to check your results fully and see the BOM's.
' ReadUTF8SaveFileInUTF16 "C:\XML\Gil
Encodings\XM8_UTF_vb.xml","C:\XML\Gil Encodings\test16.xml"
Sub ReadUTF8SaveFileInUTF16(strFileIn As String, strFileOut As String)
'1/2 ReadToFile / SaveToFile snippet
'http://www.codeproject.com/soap/XMJFileStreaming.asp?msg=841289&mode=all&userid=903408#xx767979xx
'used ado 2.7
Dim stm As ADODB.stream
Dim strPath As String
Dim strData As String
'the character set names for the machine are in the registry
'For a list of the character set strings that is known by a system, see
'the subkeys of HKEY_CLASSES_ROOT\MIME\Database\Charset
'in the Windows Registry.
Set stm = New ADODB.stream
stm.Open
stm.Charset = "UTF-8" 'input file character set
stm.Position = 0
stm.Type = adTypeText
'
stm.LoadFromFile strFileIn
' if you just try and dump out stream
' without reading and writing you get a double BOM
stm.Position = 0 'reset to beginning of stream
Dim strDataout
strData = stm.ReadText()
' line below used to change encoding instruction for xml files
' <?xml version="1.0" encoding="UTF-16" ?>
strData = Replace(strData, "utf-8", "UTF-16", 1, 1)
Debug.Print strData
stm.Position = 0
' set output file character set
stm.Charset = "UTF-16" ' "Unicode" '"iso-8859-1" "ascii" '"Big5"
'"hebrew"
stm.WriteText (strData)
stm.SaveToFile strFileOut, adSaveCreateOverWrite
stm.Close
Set stm = Nothing
End Sub
Stuart McLachlan wrote:
>On 25 Jan 2006 at 16:15, MartyConnelly wrote:
>
>
>
>>Well the microsoft guys seem to write it as binary Unicode complete with a
>>UTF-16 little-endian BOM marker. It might look different opening it with the
>>old Win95 notepad which can't handle saving in unicode.
>>
>>
>>
>
>It's still a text file, it's just that it's UTF-16 not ASCII encoded.
>
>To quote Joel Spolsky in his article "The Absolute Minimum Every Software
>Developer Absolutely, Positively Must Know About Unicode and Character Sets
>(No Excuses!)" http://www.joelonsoftware.com/articles/Unicode.html :
>
><quote>
>It does not make sense to have a string without knowing what encoding it
>uses. You can no longer stick your head in the sand and pretend that
>"plain" text is ASCII.
></quote>
>
>As long as you use a Unicode capable text editor, such as the freeware
>Crimson Editor which I use, you can create/edit a UDL in it.
>
>The INI file routines that I posted a few days ago also work to read/write
>to the file - as long as you create the UDL first as a Unicode encoded text
>file with the correct second line comment.)
>
>
>
>
>
>
--
Marty Connelly
Victoria, B.C.
Canada