[dba-Tech] Something I just learned

MartyConnelly martyconnelly at shaw.ca
Wed Mar 30 17:40:22 CST 2005


Just a word of warning about some of this, you will run into it at some 
point in time since Unicode files can be UTF-8 or UTF-16.
There are the Windows typographic characters known by their HTML4 
character entity names, such as —, ‘, ™ and so on (  
and " emdash etc.). These have in fact been around for a while, and are 
understood even by a number of older browsers that do not support utf-8 
and would not be able to understand the corresponding unicode 
&#bignumber; representations. These have been around from before the 
Unicode standard was set.

Now if you consider the Western European "MS-Windows" character set, 
windows-1252. This is a special cause of confusion: all of the 
displayable character code values of iso-8859-1 coincide with the same 
codes in this Windows code - but additionally, the Windows code assigns 
displayable characters in the area which the iso-8859-n codes reserved 
for control functions. In unicode, those characters have code values 
above 256.

I am scratching my head about this because these windows typographical 
characters ANSI 128 -159 as control characters are considered illegal 
characters in XML
for example " decimal 153 hex 99 and should be unicode escaped character 
"™" but some UTF-8 conversion programs don't do this conversion.
properly so it screws up your xml parsers with illegal characters. I am 
almost tempted to do everything in UTF-16.

The windows control characters that cause the problem run from ANSI 
decimal 128-159.

If that isn't enough some little darlings changed the ISO-8859-1 spec to 
handle the Euro character and you now have to look at Latin-9 or ISO-8859-15

I still haven't groked all this yet. I still have to hunt through xml 
files with a hexeditor to see what is going on..

Steve Erbach wrote:

>I had been wondering how to insert Unicode characters into a document.
>There's a nifty web site (
>http://www.visibone.com/htmlref/char/cer.htm ) that shows the HTML
>numeric codes for the entire Unicode set. I then went into Microsoft
>Word 2003 and found that if you know the hexadecimal number for a
>Unicode character (265B, for example) then all you have to do is type
>that number and press Alt-X, and the number will be converted to the
>Unicode character, in this case, a Black chess Queen.
>
>There's also the entire Unicode set in Word under Insert | Symbol. The
>Symbols tab has a Font list. I picked the Arial Unicode MS font. There
>is another drop down list with "subsets" of the Unicode list. So you
>could jump to Miscellaneous Dingbats and locate the Black Chess Queen
>that way.
>
>The Alt-X shortcut works in Word, WordPad, and Windows Messenger, but
>not in Access.
>
>  
>

-- 
Marty Connelly
Victoria, B.C.
Canada






More information about the dba-Tech mailing list