[dba-Tech] Something I just learned

Steve Erbach erbachs at gmail.com
Thu Mar 31 08:27:46 CST 2005


Marty,

>> There are the Windows typographic characters known by their HTML4
character entity names, such as —, ‘, ™ and so on (

and " emdash etc.). These have in fact been around for a while, and
are understood even by a number of older browsers that do not support
utf-8 <<

Interesting. I'm using Gmail and the first two characters you typed in
within the parentheses showed up as little boxes. You're right about
the lack of universality. Gmail tells me that JavaScript is NOT
enabled in Firefox when I know for a fact that it is...hmmm.

>> I am scratching my head about this because these windows
typographical characters ANSI 128 -159 as control characters are
considered illegal characters in XML for example " decimal 153 hex 99
and should be unicode escaped character
"™" but some UTF-8 conversion programs don't do this conversion.
properly so it screws up your xml parsers with illegal characters. I
am almost tempted to do everything in UTF-16. <<

I've seen a similar thing in an XML file I've been downloading lately.
I now devote some computer resources to the search for Riemann
hypothesis zeroes, one of those multiple computer shared resource
things. Anyway, the website (zetagrid.net) publishes stats on who's
produced the most zeroes over a period of time. Well, one of those
lists has a  character in it -- Ctrl-Z -- which Access chokes on
when I try to import the XML file. My XML editor (Peter's XML Editor)
finds the naughty character and I can erase it and re-save the file.

I take it that you work much more extensively with Unicode than I do
in my piddly little efforts.

Steve Erbach


On Wed, 30 Mar 2005 15:40:22 -0800, MartyConnelly <martyconnelly at shaw.ca> wrote:
> Just a word of warning about some of this, you will run into it at some
> point in time since Unicode files can be UTF-8 or UTF-16.
> There are the Windows typographic characters known by their HTML4
> character entity names, such as —, ‘, ™ and so on ( 
> and " emdash etc.). These have in fact been around for a while, and are
> understood even by a number of older browsers that do not support utf-8
> and would not be able to understand the corresponding unicode
> &#bignumber; representations. These have been around from before the
> Unicode standard was set.
> 
> Now if you consider the Western European "MS-Windows" character set,
> windows-1252. This is a special cause of confusion: all of the
> displayable character code values of iso-8859-1 coincide with the same
> codes in this Windows code - but additionally, the Windows code assigns
> displayable characters in the area which the iso-8859-n codes reserved
> for control functions. In unicode, those characters have code values
> above 256.
> 
> I am scratching my head about this because these windows typographical
> characters ANSI 128 -159 as control characters are considered illegal
> characters in XML
> for example " decimal 153 hex 99 and should be unicode escaped character
> "™" but some UTF-8 conversion programs don't do this conversion.
> properly so it screws up your xml parsers with illegal characters. I am
> almost tempted to do everything in UTF-16.
> 
> The windows control characters that cause the problem run from ANSI
> decimal 128-159.
> 
> If that isn't enough some little darlings changed the ISO-8859-1 spec to
> handle the Euro character and you now have to look at Latin-9 or ISO-8859-15
> 
> I still haven't groked all this yet. I still have to hunt through xml
> files with a hexeditor to see what is going on..



More information about the dba-Tech mailing list