Character set encoding

XML uses Unicode, but that still leaves certain encoding issues to deal with. See C9 of the XML FAQ (http://www.ucc.ie/xml/#FAQ-CHARENTS) for more details.

John Cowan posted a nice explanation of them last January. See
http://www.lists.ic.ac.uk/hypermail/xml-dev/xml-dev-Jan-1999/0176.html.

To make this easier, special characters are often represented as entity references, like the & and < strings used in HTML. A common representation of "a" with an umlaut is ä, but this has to be declared before a processor understands it. See sections 4.1
(http://www.w3.org/TR/REC-xml#dt-entref) 4.2.1
(http://www.w3.org/TR/REC-xml#sec-internal-ent) of the XML spec for more.

The XML Specification says that any document without a character-set encoding specification is to be considered a UTF-8 encoded file. If you are creating your XML file on the Windows platform and are using a simple text editor (such as Notepad) or any other Windows application, then your character set very likely is the regular "ANSI" Windows character set (which technically speaking is called WinLatin-1 or Codepage 1252, whose correct ISO encoding name is "windows-1252").

Consequently, you have to specify this encoding at the top of your XML file:

<?xml version="1.0" encoding="windows-1252">

If you add this to your XML file, you will notice that XML Spy will immediately display these characters correctly and also Internet Explorer 5 will now know what you are talking about ;)

The reason that XML Spy did replace these letters with an "_" previously is that in the default UTF-8 encoding, these characters would have to be encoded differently - and the byte codes that were contained in the file were simply illegal in terms of the UTF-8 encoding (but it did warn you that illegal characters were encountered, didn't it)...

... ALEXANDER FALK ... President, CEO ... http://www.icon.at/falk


I have just started using XML. I created a DTD to represent a chapter in a text book. I also created a XML document for a chapter using data from a text book.

I use IE5 and XML-Spy software to view the XML document i have created.

The data contains some letters with accent marks(or "diacritics"). Some of these letters are í, ó, ñ, à, ú.

IE5 returns an error that it has encountered a invalid character, when it finds the above character and does not process the XML document any further.

XML-Spy software replaces these letters with an underscore character and displays the document in its own browser window.

Can any body explain what is happening? Does this mean that i cannot use the letters with accent marks in a XML document?

Does this mean that i cannot use the letters with accent marks in a XML document?

The general problem was originally: what byte does a file use to represent a particular accented character? For example, to represent an a with an umlaut, a Mac uses byte 138, DOS uses byte 132, and Windows and Unicode use the value 228.

There will be about 6000 more very rare characers in Unicode 3.0, which is now in beta, none of them in the Astral Planes. Plane 2 will be dedicated, probably in Unicode 4.0, to the ultra-rare characters, particularly those used only in names, etc.

I'm fairly new to this, and I'm working on an XML version of a scholarly text that includes quotations in ancient Greek. I've figured out how to use ISO character entities to display Greek letters, but these seem to be for only for modern Greek -- ancient Greek includes more accent marks (like graves) and marks to indicate rough or smooth breathing. Does anyone know of any XML versions of ancient Greek character entities?

I have another question will very likely reveal my thin understanding of character encoding issues --

I notice that the creators of the Women Writers Project at Brown have produced their own ancient Greek character entities for SGML. http://www.wwp.brown.edu/encoding/training/Entities/wwpgrk1.ent.html Could I do something similar, and somehow or other create my own or adapt an SGML entity set for XML? If so, how?

One additional, smaller, question: I have so far failed in my attempts to declare a character entity set rather than individual entities. I have tried the "typical invocation" without success. Would someone mind sending me an example of a set invocation in action? (Of possible relevance: I'm using IE5, an external DTD, and so far my files are not networked.)

 

I would recommend to use UTF-8 all the time. In this moment it is a bit problematic because most editors does not support this encoding, but it is supported in new versions of both Netscape and IE. If you put :
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
then the pages will display correctly regardless of current browser settings. In my work I am using common tools and transform to unicode with my small converter based on java API
(http://zvon.vscht.cz:/ZvonHTML/Downloads/convert_en.html).

> >Consequently, you have to specify this encoding at the top of your XML
> >file:
> > <?xml version="1.0" encoding="windows-1252"?>
>
> Does Setting the encoding scheme of the document to 'windows-1252' mean
> that the document now would become platform dependent? I hope this is a
> valid question.
>
> >A common representation of "a" with an umlaut is &auml;, but this has to
> >be declared before a processor understands it.
>
> Does declaring mean to declare it in the DTD i have created? Another
> interesting thing i noted was that if i used &auml without any declaration,
> IE5 did not display the document. But I used the equivalent numeric Entity
> which is '&#228;' in the same document, IE5 displayed it without any
> problem. Why?

 

>>A common representation of "a" with an umlaut is &auml;, but this has to
>>be declared before a processor understands it.

>Does declaring mean to declare it in the DTD i have created?

Yes, as an entity declaration.

>Another interesting thing i noted was that if i used &auml without any
declaration,
>IE5 did not display the document. But I used the equivalent numeric Entity
>which is '&#228;' in the same document, IE5 displayed it without any
>problem. Why?

&#228; is a character reference, and IE knows the character it refers to.
&auml; is an entity reference, and these must refer to declared entities
(with the exceptions of &amp;, &lt;, &gt;, &apos;, and &quot;, which parsers
are required to recognize without declarations). The following entity
declaration in your DTD will tell the processor which character to use for
&auml:

<!ENTITY auml "&#228;" >

You can then use &auml; instead of the less-readable &#228;.

Rick Jelliffe put together a set of such declarations based on the SGML
equivalents. See http://www.oasis-open.org/cover/xml-ISOents.txt. For a
little background on these, search for "XML-ized" in
http://www.oasis-open.org/cover/xml.html.

Bob DuCharme www.snee.com/bob <bob@
snee.com> see www.snee.com/bob/xmlann for "XML:
The Annotated Specification" from Prentice Hall.

---------------

Hello,

You might try ISO-8859-2 instead of windows-1252. The ISO encoding seems to work
for me. I may also be off-base.

Good luck,

Richard

Richard Lander
relander at uwaterloo.ca
http://pdbeam.uwaterloo.ca/~rlander/

Professional XML Authoring
http://www.on-line-learning.com/

----- Original Message -----
From: Deepak Chandran <deepakc@VERSAWARE.COM>
To: <XML-L@LISTSERV.HEANET.IE>
Sent: Thursday, August 05, 1999 12:43 AM
Subject: Re: Can Letters with accent marks be used in a XML document?


> Hi,
>
> Thanks for the help and advice.
>
> >Consequently, you have to specify this encoding at the top of your XML
> >file:
> > <?xml version="1.0" encoding="windows-1252"?>
>
> Does Setting the encoding scheme of the document to 'windows-1252' mean
> that the document now would become platform dependent? I hope this is a
> valid question.
>
> >A common representation of "a" with an umlaut is &auml;, but this has to
> >be declared before a processor understands it.
>
> Does declaring mean to declare it in the DTD i have created? Another
> interesting thing i noted was that if i used &auml without any declaration,
> IE5 did not display the document. But I used the equivalent numeric Entity
> which is '&#228;' in the same document, IE5 displayed it without any
> problem. Why?
>
> Regards,
> Deepak
>-- Ron Bourret

----------------------

I've put my first attempt at character entity for Ancient Greek at:

http://www.jtauber.com/hgrk/greek.ent

They are based on the TEI entity names Ancient Greek.
Please let me know if you find any mistakes.

NOTE: This is just a temporary location. I'm about to go on a flight from
Australia to the US and just wanted to put it up somewhere before I left.

James