public inbox for [email protected]  
help / color / mirror / Atom feed
This approach to non-ASCII names does not work
16+ messages / 4 participants
[nested] [flat]

* This approach to non-ASCII names does not work
@ 2006-09-19 17:24  Tom Lane <[email protected]>
  0 siblings, 1 reply; 16+ messages in thread

From: Tom Lane @ 2006-09-19 17:24 UTC (permalink / raw)
  To: Bruce Momjian <[email protected]>; +Cc: pgsql-docs

openjade -V draft-mode -wall -wno-unused-param -wno-empty -D . -c /usr/share/sgml/docbook/dsssl-stylesheets/catalog -d stylesheet.dsl -i output-html -t sgml postgres.sgml
openjade:release.sgml:567:14:E: "353" is not a character number in the document character set
openjade:release.sgml:1085:56:E: "305" is not a character number in the document character set
openjade:release.sgml:1085:63:E: "305" is not a character number in the document character set
openjade:release.sgml:1497:35:E: "305" is not a character number in the document character set
openjade:release.sgml:1497:42:E: "305" is not a character number in the document character set
openjade:release.sgml:1662:38:E: "305" is not a character number in the document character set
openjade:release.sgml:1662:45:E: "305" is not a character number in the document character set
make: *** [html] Error 1

			regards, tom lane



^ permalink  raw  reply  [nested|flat] 16+ messages in thread

* Re: This approach to non-ASCII names does not work
@ 2006-09-19 19:13  Bruce Momjian <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 1 reply; 16+ messages in thread

From: Bruce Momjian @ 2006-09-19 19:13 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: pgsql-docs

Tom Lane wrote:
> openjade -V draft-mode -wall -wno-unused-param -wno-empty -D . -c /usr/share/sgml/docbook/dsssl-stylesheets/catalog -d stylesheet.dsl -i output-html -t sgml postgres.sgml
> openjade:release.sgml:567:14:E: "353" is not a character number in the document character set
> openjade:release.sgml:1085:56:E: "305" is not a character number in the document character set
> openjade:release.sgml:1085:63:E: "305" is not a character number in the document character set
> openjade:release.sgml:1497:35:E: "305" is not a character number in the document character set
> openjade:release.sgml:1497:42:E: "305" is not a character number in the document character set
> openjade:release.sgml:1662:38:E: "305" is not a character number in the document character set
> openjade:release.sgml:1662:45:E: "305" is not a character number in the document character set
> make: *** [html] Error 1

Wow, our documentation characterset is "ISO-8859-1":

	CONTENT="text/html; charset=ISO-8859-1"

Should we change it to UTF8?

-- 
  Bruce Momjian   [email protected]
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +



^ permalink  raw  reply  [nested|flat] 16+ messages in thread

* Re: This approach to non-ASCII names does not work
@ 2006-09-19 19:19  Tom Lane <[email protected]>
  parent: Bruce Momjian <[email protected]>
  0 siblings, 2 replies; 16+ messages in thread

From: Tom Lane @ 2006-09-19 19:19 UTC (permalink / raw)
  To: Bruce Momjian <[email protected]>; +Cc: pgsql-docs

Bruce Momjian <[email protected]> writes:
> Tom Lane wrote:
>> openjade:release.sgml:567:14:E: "353" is not a character number in the document character set

> Wow, our documentation characterset is "ISO-8859-1":
> 	CONTENT="text/html; charset=ISO-8859-1"
> Should we change it to UTF8?

I'm betting you should change those numbers from octal to decimal,
actually.

			regards, tom lane



^ permalink  raw  reply  [nested|flat] 16+ messages in thread

* Re: This approach to non-ASCII names does not work
@ 2006-09-19 19:22  Bruce Momjian <[email protected]>
  parent: Tom Lane <[email protected]>
  1 sibling, 1 reply; 16+ messages in thread

From: Bruce Momjian @ 2006-09-19 19:22 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: pgsql-docs

Tom Lane wrote:
> Bruce Momjian <[email protected]> writes:
> > Tom Lane wrote:
> >> openjade:release.sgml:567:14:E: "353" is not a character number in the document character set
> 
> > Wow, our documentation characterset is "ISO-8859-1":
> > 	CONTENT="text/html; charset=ISO-8859-1"
> > Should we change it to UTF8?
> 
> I'm betting you should change those numbers from octal to decimal,
> actually.

Those numbers are decimal, but certainly cannot be represented in
ISO-8859-1.  They are multi-byte, one is Turkish.

-- 
  Bruce Momjian   [email protected]
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +



^ permalink  raw  reply  [nested|flat] 16+ messages in thread

* Re: This approach to non-ASCII names does not work
@ 2006-09-19 19:24  Bruce Momjian <[email protected]>
  parent: Bruce Momjian <[email protected]>
  0 siblings, 0 replies; 16+ messages in thread

From: Bruce Momjian @ 2006-09-19 19:24 UTC (permalink / raw)
  To: Bruce Momjian <[email protected]>; +Cc: Tom Lane <[email protected]>; pgsql-docs

Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <[email protected]> writes:
> > > Tom Lane wrote:
> > >> openjade:release.sgml:567:14:E: "353" is not a character number in the document character set
> > 
> > > Wow, our documentation characterset is "ISO-8859-1":
> > > 	CONTENT="text/html; charset=ISO-8859-1"
> > > Should we change it to UTF8?
> > 
> > I'm betting you should change those numbers from octal to decimal,
> > actually.
> 
> Those numbers are decimal, but certainly cannot be represented in
> ISO-8859-1.  They are multi-byte, one is Turkish.

Actually, I got the codes from here:

	http://www.pemberley.com/janeinfo/latin1.html#latexta

-- 
  Bruce Momjian   [email protected]
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +



^ permalink  raw  reply  [nested|flat] 16+ messages in thread

* Re: This approach to non-ASCII names does not work
@ 2006-09-20 19:06  Peter Eisentraut <[email protected]>
  parent: Tom Lane <[email protected]>
  1 sibling, 1 reply; 16+ messages in thread

From: Peter Eisentraut @ 2006-09-20 19:06 UTC (permalink / raw)
  To: pgsql-docs; +Cc: Tom Lane <[email protected]>; Bruce Momjian <[email protected]>

Tom Lane wrote:
> I'm betting you should change those numbers from octal to decimal,
> actually.

I suggest using named entities like &uuml;.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/



^ permalink  raw  reply  [nested|flat] 16+ messages in thread

* Re: This approach to non-ASCII names does not work
@ 2006-09-20 20:25  Bruce Momjian <[email protected]>
  parent: Peter Eisentraut <[email protected]>
  0 siblings, 2 replies; 16+ messages in thread

From: Bruce Momjian @ 2006-09-20 20:25 UTC (permalink / raw)
  To: Peter Eisentraut <[email protected]>; +Cc: pgsql-docs; Tom Lane <[email protected]>

Peter Eisentraut wrote:
> Tom Lane wrote:
> > I'm betting you should change those numbers from octal to decimal,
> > actually.
> 
> I suggest using named entities like &uuml;.

Yes, I use them where possible.  I use:

                                 
	http://www.mountaindragon.com/html/iso.htm

for named cases, but for the ones that don't have names, I have to use
UTF8 numbers:
                                 
	http://www.pemberley.com/janeinfo/latin1.html#latexta

The case that I needed was "Latin Small Letter Dotless I", which has no
name on the first URL.

The unusual thing is that though our docs web pages use a stated
encoding as ISO-8859-1, the UTF8 number does generate the proper symbol
in my browser (Mozilla), so I wonder if >255 codes are assumed to be
UTF8.

-- 
  Bruce Momjian   [email protected]
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +



^ permalink  raw  reply  [nested|flat] 16+ messages in thread

* Re: This approach to non-ASCII names does not work
@ 2006-09-20 20:29  Tom Lane <[email protected]>
  parent: Bruce Momjian <[email protected]>
  1 sibling, 1 reply; 16+ messages in thread

From: Tom Lane @ 2006-09-20 20:29 UTC (permalink / raw)
  To: Bruce Momjian <[email protected]>; +Cc: Peter Eisentraut <[email protected]>; pgsql-docs

Bruce Momjian <[email protected]> writes:
> Yes, I use them where possible.  I use:
> 	http://www.mountaindragon.com/html/iso.htm

... which says right on it that it considers only ISO 8859/1 and is not
a complete list even of that set.

I assume that somewhere there is a Web-related spec of the widely
recognized entity names, but I see no reason to suppose that this list
is it.  Something at w3c, say, would have a tad more credibility.

			regards, tom lane



^ permalink  raw  reply  [nested|flat] 16+ messages in thread

* Re: This approach to non-ASCII names does not work
@ 2006-09-20 20:35  Alvaro Herrera <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 1 reply; 16+ messages in thread

From: Alvaro Herrera @ 2006-09-20 20:35 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: Bruce Momjian <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-docs

Tom Lane wrote:
> Bruce Momjian <[email protected]> writes:
> > Yes, I use them where possible.  I use:
> > 	http://www.mountaindragon.com/html/iso.htm
> 
> ... which says right on it that it considers only ISO 8859/1 and is not
> a complete list even of that set.
> 
> I assume that somewhere there is a Web-related spec of the widely
> recognized entity names, but I see no reason to suppose that this list
> is it.  Something at w3c, say, would have a tad more credibility.

Maybe this:

http://www.w3.org/TR/html4/sgml/entities.html

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.



^ permalink  raw  reply  [nested|flat] 16+ messages in thread

* Re: This approach to non-ASCII names does not work
@ 2006-09-20 20:37  Tom Lane <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  0 siblings, 1 reply; 16+ messages in thread

From: Tom Lane @ 2006-09-20 20:37 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Bruce Momjian <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-docs

Alvaro Herrera <[email protected]> writes:
> Tom Lane wrote:
>> I assume that somewhere there is a Web-related spec of the widely
>> recognized entity names, but I see no reason to suppose that this list
>> is it.  Something at w3c, say, would have a tad more credibility.

> Maybe this:
> http://www.w3.org/TR/html4/sgml/entities.html

Also, I just found this in the XHTML 1.0 spec:
http://www.w3.org/TR/xhtml1/dtds.html#a_dtd_Latin-1_characters

			regards, tom lane



^ permalink  raw  reply  [nested|flat] 16+ messages in thread

* Re: This approach to non-ASCII names does not work
@ 2006-09-20 20:43  Alvaro Herrera <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 0 replies; 16+ messages in thread

From: Alvaro Herrera @ 2006-09-20 20:43 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: Bruce Momjian <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-docs

Tom Lane wrote:
> Alvaro Herrera <[email protected]> writes:
> > Tom Lane wrote:
> >> I assume that somewhere there is a Web-related spec of the widely
> >> recognized entity names, but I see no reason to suppose that this list
> >> is it.  Something at w3c, say, would have a tad more credibility.
> 
> > Maybe this:
> > http://www.w3.org/TR/html4/sgml/entities.html
> 
> Also, I just found this in the XHTML 1.0 spec:
> http://www.w3.org/TR/xhtml1/dtds.html#a_dtd_Latin-1_characters

Neither seem to list a "dotless i" :-(

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.



^ permalink  raw  reply  [nested|flat] 16+ messages in thread

* Re: This approach to non-ASCII names does not work
@ 2006-09-20 21:10  Tom Lane <[email protected]>
  0 siblings, 1 reply; 16+ messages in thread

From: Tom Lane @ 2006-09-20 21:10 UTC (permalink / raw)
  To: Bruce Momjian <[email protected]>; +Cc: Peter Eisentraut <[email protected]>; pgsql-docs

Bruce Momjian <[email protected]> writes:
> Interesting, I found this for that character:
> 	http://www.fileformat.info/info/unicode/char/0131/index.htm
> Turns out that number is the right entity.  Seems they have numbers that
> match UTF16/UTF32 values.  So are we OK?

No, we are not, because the docs don't build for anyone who has pickier
SGML tools than the ancient laissez-faire toolchain you seem to be using.
HEAD currently gives me

openjade -V draft-mode -wall -wno-unused-param -wno-empty -D . -c /usr/share/sgml/docbook/dsssl-stylesheets/catalog -d stylesheet.dsl -i output-html -t sgml postgres.sgml
openjade:ddl.sgml:2581:51:E: document type does not allow element "SECT2" here
openjade:ddl.sgml:2646:39:E: document type does not allow element "SECT2" here
openjade:ddl.sgml:2706:52:E: document type does not allow element "SECT2" here
openjade:ddl.sgml:2848:8:E: end tag for "SECT2" omitted, but OMITTAG NO was specified
openjade:ddl.sgml:2317:3: start tag was here
openjade:release.sgml:572:14:E: "353" is not a character number in the document character set
openjade:release.sgml:1091:56:E: "305" is not a character number in the document character set
openjade:release.sgml:1091:63:E: "305" is not a character number in the document character set
openjade:release.sgml:1505:35:E: "305" is not a character number in the document character set
openjade:release.sgml:1505:42:E: "305" is not a character number in the document character set
openjade:release.sgml:1670:38:E: "305" is not a character number in the document character set
openjade:release.sgml:1670:45:E: "305" is not a character number in the document character set
make: *** [html] Error 1

I don't believe in ignoring compiler warnings, and I don't believe in
ignoring these problems either.

			regards, tom lane



^ permalink  raw  reply  [nested|flat] 16+ messages in thread

* Re: This approach to non-ASCII names does not work
@ 2006-09-20 21:38  Tom Lane <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 1 reply; 16+ messages in thread

From: Tom Lane @ 2006-09-20 21:38 UTC (permalink / raw)
  To: Bruce Momjian <[email protected]>; +Cc: Peter Eisentraut <[email protected]>; pgsql-docs

The HTML specs do include the other character at issue:

!ENTITY scaron  "&#353;"> <!--  latin small letter s with caron,
                                    U+0161 ISOlat2 -->

I suggest we use that where needed and spell dotless i as plain i.
(Sorry, Volkan :-( ... but your beef is with the HTML standards
not us.)

			regards, tom lane



^ permalink  raw  reply  [nested|flat] 16+ messages in thread

* Re: This approach to non-ASCII names does not work
@ 2006-09-20 21:47  Peter Eisentraut <[email protected]>
  parent: Bruce Momjian <[email protected]>
  1 sibling, 1 reply; 16+ messages in thread

From: Peter Eisentraut @ 2006-09-20 21:47 UTC (permalink / raw)
  To: Bruce Momjian <[email protected]>; +Cc: pgsql-docs; Tom Lane <[email protected]>

Bruce Momjian wrote:
> The unusual thing is that though our docs web pages use a stated
> encoding as ISO-8859-1, the UTF8 number does generate the proper
> symbol in my browser (Mozilla), so I wonder if >255 codes are assumed
> to be UTF8.

These are two different things.

A numeric character reference picks the numbered character from the 
document character set.  The document character set is declared in the 
document type declaration (and is therefore fixed by the standards 
committee for all users).  The document character sets for commonly 
used SGML applications are:

HTML 3.2	Latin 1 (ISO 646 + ECMA 94)
HTML 4+		UCS (ISO 10646)
XML		UCS (ISO 10646)
DocBook SGML	Latin 1 (ISO 646 + ECMA 94)

If a font is available, an HTML application (browser) should be able to 
process (display) any character from the document character set, 
whether it arrives in plain or as a character entity.

Conversely, a character not in the document character set, such as a 
non-Latin-1 character in DocBook SGML, cannot be processed, strictly 
speaking.

The other thing you are talking about is the character *encoding* which 
specifies how the sequence of bytes that makes up the document is to be 
interpreted.  Note that this happens before the document character set 
is taken into consideration and is pretty much independent of it.  For 
example, knowledge of the character encoding is necessary to find 
the "&" that starts entities.  Not all character encodings are capable 
of encoding all characters in the document character set, which is why 
you need to use character entities to access characters outside the 
encoding.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/



^ permalink  raw  reply  [nested|flat] 16+ messages in thread

* Re: This approach to non-ASCII names does not work
@ 2006-09-20 22:48  Bruce Momjian <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 0 replies; 16+ messages in thread

From: Bruce Momjian @ 2006-09-20 22:48 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: Peter Eisentraut <[email protected]>; pgsql-docs

Tom Lane wrote:
> The HTML specs do include the other character at issue:
> 
> !ENTITY scaron  "&#353;"> <!--  latin small letter s with caron,
>                                     U+0161 ISOlat2 -->

Release notes updated to use scaron.

-- 
  Bruce Momjian   [email protected]
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +




^ permalink  raw  reply  [nested|flat] 16+ messages in thread

* Re: This approach to non-ASCII names does not work
@ 2006-09-22 17:17  Bruce Momjian <[email protected]>
  parent: Peter Eisentraut <[email protected]>
  0 siblings, 0 replies; 16+ messages in thread

From: Bruce Momjian @ 2006-09-22 17:17 UTC (permalink / raw)
  To: Peter Eisentraut <[email protected]>; +Cc: pgsql-docs; Tom Lane <[email protected]>; [email protected]


That makes a lot of sense.  The encoding mentioned in the HTML is how
high-bit characters are treated in the HTML, and doesn't control what
entities it supports.

However, I am confused how non-Latin users can use SGML if it does not
support UTF8 entities.  I see this flag in openjade:

	  -b, --encoding=NAME         Use encoding NAME for output.

but I assume it is only for how to treat the high bits in the file, not
for entity recognition.

I IM'ed with Peter and he said SGML Docbook just doesn't support UTF8
easily, so I am reverting Volkan YAZICI's name to be ASCII (he requested
an all-uppercase last name if we can't use the proper symbol), and
documented we can only use HTML4 entities, and updated the URLs we
should use for reference.  I have the official URL and URLs that show
the actual symbols too, which is helpful.

If people have names that contain HTML4 symbols, please let me know so I
can add the symbols:

	http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html

---------------------------------------------------------------------------

Peter Eisentraut wrote:
> Bruce Momjian wrote:
> > The unusual thing is that though our docs web pages use a stated
> > encoding as ISO-8859-1, the UTF8 number does generate the proper
> > symbol in my browser (Mozilla), so I wonder if >255 codes are assumed
> > to be UTF8.
> 
> These are two different things.
> 
> A numeric character reference picks the numbered character from the 
> document character set.  The document character set is declared in the 
> document type declaration (and is therefore fixed by the standards 
> committee for all users).  The document character sets for commonly 
> used SGML applications are:
> 
> HTML 3.2	Latin 1 (ISO 646 + ECMA 94)
> HTML 4+		UCS (ISO 10646)
> XML		UCS (ISO 10646)
> DocBook SGML	Latin 1 (ISO 646 + ECMA 94)
> 
> If a font is available, an HTML application (browser) should be able to 
> process (display) any character from the document character set, 
> whether it arrives in plain or as a character entity.
> 
> Conversely, a character not in the document character set, such as a 
> non-Latin-1 character in DocBook SGML, cannot be processed, strictly 
> speaking.
> 
> The other thing you are talking about is the character *encoding* which 
> specifies how the sequence of bytes that makes up the document is to be 
> interpreted.  Note that this happens before the document character set 
> is taken into consideration and is pretty much independent of it.  For 
> example, knowledge of the character encoding is necessary to find 
> the "&" that starts entities.  Not all character encodings are capable 
> of encoding all characters in the document character set, which is why 
> you need to use character entities to access characters outside the 
> encoding.
> 
> -- 
> Peter Eisentraut
> http://developer.postgresql.org/~petere/
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

-- 
  Bruce Momjian   [email protected]
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +




^ permalink  raw  reply  [nested|flat] 16+ messages in thread


end of thread, other threads:[~2006-09-22 17:17 UTC | newest]

Thread overview: 16+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2006-09-19 17:24 This approach to non-ASCII names does not work Tom Lane <[email protected]>
2006-09-19 19:13 ` Bruce Momjian <[email protected]>
2006-09-19 19:19   ` Tom Lane <[email protected]>
2006-09-19 19:22     ` Bruce Momjian <[email protected]>
2006-09-19 19:24       ` Bruce Momjian <[email protected]>
2006-09-20 19:06     ` Peter Eisentraut <[email protected]>
2006-09-20 20:25       ` Bruce Momjian <[email protected]>
2006-09-20 20:29         ` Tom Lane <[email protected]>
2006-09-20 20:35           ` Alvaro Herrera <[email protected]>
2006-09-20 20:37             ` Tom Lane <[email protected]>
2006-09-20 20:43               ` Alvaro Herrera <[email protected]>
2006-09-20 21:47         ` Peter Eisentraut <[email protected]>
2006-09-22 17:17           ` Bruce Momjian <[email protected]>
2006-09-20 21:10 Re: This approach to non-ASCII names does not work Tom Lane <[email protected]>
2006-09-20 21:38 ` Tom Lane <[email protected]>
2006-09-20 22:48   ` Bruce Momjian <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox