MartyConnelly
martyconnelly at shaw.ca
Thu Jan 26 01:41:45 CST 2006
Yup I have run into that character encoding and character set problems with XML files a few years back, I remember discussing it with Pedro Gill a few years back http://www.geocities.com/pmpg98_pt/CharacterEncoding.html But I got around the problem he had with double BOM's and changing character encodings. Most XML parsers check the BOM of an input file first. I suppose you could use something like this code below to change a text file using ADO streams from say ISO-8859-5 to Unicode or even Chinese Big 5 or say UTF-8 to UTF-16. You need a hex editor to check your results fully and see the BOM's. ' ReadUTF8SaveFileInUTF16 "C:\XML\Gil Encodings\XM8_UTF_vb.xml","C:\XML\Gil Encodings\test16.xml" Sub ReadUTF8SaveFileInUTF16(strFileIn As String, strFileOut As String) '1/2 ReadToFile / SaveToFile snippet 'http://www.codeproject.com/soap/XMJFileStreaming.asp?msg=841289&mode=all&userid=903408#xx767979xx 'used ado 2.7 Dim stm As ADODB.stream Dim strPath As String Dim strData As String 'the character set names for the machine are in the registry 'For a list of the character set strings that is known by a system, see 'the subkeys of HKEY_CLASSES_ROOT\MIME\Database\Charset 'in the Windows Registry. Set stm = New ADODB.stream stm.Open stm.Charset = "UTF-8" 'input file character set stm.Position = 0 stm.Type = adTypeText ' stm.LoadFromFile strFileIn ' if you just try and dump out stream ' without reading and writing you get a double BOM stm.Position = 0 'reset to beginning of stream Dim strDataout strData = stm.ReadText() ' line below used to change encoding instruction for xml files ' <?xml version="1.0" encoding="UTF-16" ?> strData = Replace(strData, "utf-8", "UTF-16", 1, 1) Debug.Print strData stm.Position = 0 ' set output file character set stm.Charset = "UTF-16" ' "Unicode" '"iso-8859-1" "ascii" '"Big5" '"hebrew" stm.WriteText (strData) stm.SaveToFile strFileOut, adSaveCreateOverWrite stm.Close Set stm = Nothing End Sub Stuart McLachlan wrote: >On 25 Jan 2006 at 16:15, MartyConnelly wrote: > > > >>Well the microsoft guys seem to write it as binary Unicode complete with a >>UTF-16 little-endian BOM marker. It might look different opening it with the >>old Win95 notepad which can't handle saving in unicode. >> >> >> > >It's still a text file, it's just that it's UTF-16 not ASCII encoded. > >To quote Joel Spolsky in his article "The Absolute Minimum Every Software >Developer Absolutely, Positively Must Know About Unicode and Character Sets >(No Excuses!)" http://www.joelonsoftware.com/articles/Unicode.html : > ><quote> >It does not make sense to have a string without knowing what encoding it >uses. You can no longer stick your head in the sand and pretend that >"plain" text is ASCII. ></quote> > >As long as you use a Unicode capable text editor, such as the freeware >Crimson Editor which I use, you can create/edit a UDL in it. > >The INI file routines that I posted a few days ago also work to read/write >to the file - as long as you create the UDL first as a Unicode encoded text >file with the correct second line comment.) > > > > > > -- Marty Connelly Victoria, B.C. Canada