[AccessD] Translation of UTF-8 to UTF-16 (Unicode) or Big5 etc, for Text files or XML

MartyConnelly martyconnelly at shaw.ca
Sun Jun 13 14:12:55 CDT 2004


I was reading an XML article on encoding where the author stated he 
couldn't get this to work
http://www.topxml.com/code/default.asp?p=3&id=v20010810181946
It might be useful to someone. I didn't know you could do this with a 
stream and took a guess at how it was handling binaries.
There are other ways to do this but this is a short method.

Essentially the code below takes a Text or XML file and changes the 
Encoding from UTF-8 to UTF-16 (Unicode)
It uses the ADODB stream object and charset property. I haven't seen 
this written up anywhere.
The trick is to read and rewrite the ADODB stream. Loading and saving 
the file results by itself in a double BOM  and garbage.

I am guessing but you may be able to go back and forth between character 
set encodings.
 assuming you are not doing something ridiculous like converting Thai 
unicode to ASCII.
This would include  Chinese Big5, JIS and various ISO encodings.

See input file samples of  characters in a variety of about 20  
languages in two encodings.
Just for Martin there is even Irish Gaelic, of course Scot's Gaelic is 
known as "The Gaelic"

http://www5.brinkster.com/mconnelly/xmltest/testUTF-8.xml
http://www5.brinkster.com/mconnelly/xmltest/testUTF-16.xm
To play around you will need the files with proper BOM markers.
http://www5.brinkster.com/mconnelly/xmltest/testUTF-16.zip

Const TopLine = ""
'or if using xml files encoding to switch processing instruction
Const TopLine = "<?xml version=""1.0"" encoding=""utf-16"" ?>"

Sub ReadUTF8SaveFileInUTF16() 
Dim stm As ADODB.stream 'ADO 2.7
Dim strData As String
Set stm = New ADODB.stream
stm.Open

stm.Charset = "UTF-8"
stm.Position = 0
stm.Type = adTypeText
stm.LoadFromFile "XM8_UTF_vb.xml"
stm.Position = 0
strData = stm.ReadText()
' line below can be removed for straight text files rather than xml.
strData = TopLine & Right$(strData, Len(strData) - Len(TopLine))

stm.Position = 0
' set output file character set to
'   "Unicode" '"iso-8859-1" "ascii" '"Big5" '"hebrew"
'The character set names for the machine are in the registry
'For a list of the character set strings that is known by a system, see
'the subkeys of HKEY_CLASSES_ROOT\MIME\Database\Charset
'in the Windows Registry.

stm.Charset = "Unicode"
stm.WriteText (strData)
stm.SaveToFile "test16.xml", adSaveCreateOverWrite
stm.Close
Set stm = Nothing
End Sub


-- 
Marty Connelly
Victoria, B.C.
Canada






More information about the AccessD mailing list