X-Original-To: pgsql-docs-postgresql.org@postgresql.org Received: from localhost (mx1.hub.org [200.46.208.251]) by postgresql.org (Postfix) with ESMTP id C492E9FB2CA for ; Wed, 20 Sep 2006 18:48:38 -0300 (ADT) Received: from postgresql.org ([200.46.204.71]) by localhost (mx1.hub.org [200.46.208.251]) (amavisd-new, port 10024) with ESMTP id 55495-04-5 for ; Wed, 20 Sep 2006 18:48:36 -0300 (ADT) X-Greylist: domain auto-whitelisted by SQLgrey- Received: from mail.gmx.net (mail.gmx.de [213.165.64.20]) by postgresql.org (Postfix) with SMTP id C5C819FB30C for ; Wed, 20 Sep 2006 18:47:15 -0300 (ADT) Received: (qmail invoked by alias); 20 Sep 2006 21:47:12 -0000 Received: from dslb-084-063-060-035.pools.arcor-ip.net (EHLO colt.pezone.net) [84.63.60.35] by mail.gmx.net (mp010) with SMTP; 20 Sep 2006 23:47:12 +0200 X-Authenticated: #495269 From: Peter Eisentraut To: Bruce Momjian Subject: Re: This approach to non-ASCII names does not work Date: Wed, 20 Sep 2006 23:47:10 +0200 User-Agent: KMail/1.9.3 Cc: pgsql-docs@postgresql.org, Tom Lane References: <200609202025.k8KKPVI12425@momjian.us> In-Reply-To: <200609202025.k8KKPVI12425@momjian.us> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200609202347.11626.peter_e@gmx.net> X-Y-GMX-Trusted: 0 X-Virus-Scanned: Maia Mailguard 1.0.1 X-Archive-Number: 200609/29 X-Sequence-Number: 3716 Bruce Momjian wrote: > The unusual thing is that though our docs web pages use a stated > encoding as ISO-8859-1, the UTF8 number does generate the proper > symbol in my browser (Mozilla), so I wonder if >255 codes are assumed > to be UTF8. These are two different things. A numeric character reference picks the numbered character from the document character set. The document character set is declared in the document type declaration (and is therefore fixed by the standards committee for all users). The document character sets for commonly used SGML applications are: HTML 3.2 Latin 1 (ISO 646 + ECMA 94) HTML 4+ UCS (ISO 10646) XML UCS (ISO 10646) DocBook SGML Latin 1 (ISO 646 + ECMA 94) If a font is available, an HTML application (browser) should be able to process (display) any character from the document character set, whether it arrives in plain or as a character entity. Conversely, a character not in the document character set, such as a non-Latin-1 character in DocBook SGML, cannot be processed, strictly speaking. The other thing you are talking about is the character *encoding* which specifies how the sequence of bytes that makes up the document is to be interpreted. Note that this happens before the document character set is taken into consideration and is pretty much independent of it. For example, knowledge of the character encoding is necessary to find the "&" that starts entities. Not all character encodings are capable of encoding all characters in the document character set, which is why you need to use character entities to access characters outside the encoding. -- Peter Eisentraut http://developer.postgresql.org/~petere/