Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ciePo-00073f-7n for pgsql-docs@arkaria.postgresql.org; Tue, 28 Feb 2017 09:49:52 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84_2) (envelope-from ) id 1ciePn-0005G8-PR for pgsql-docs@arkaria.postgresql.org; Tue, 28 Feb 2017 09:49:51 +0000 Received: from makus.postgresql.org ([2001:4800:1501:1::229]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1ciePm-0005EX-Ch for pgsql-docs@postgresql.org; Tue, 28 Feb 2017 09:49:50 +0000 Received: from mout.kundenserver.de ([212.227.126.133]) by makus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.84_2) (envelope-from ) id 1ciePj-0008Dv-6j for pgsql-docs@postgresql.org; Tue, 28 Feb 2017 09:49:49 +0000 Received: from [192.168.178.26] ([84.165.204.236]) by mrelayeu.kundenserver.de (mreue004 [212.227.15.167]) with ESMTPSA (Nemesis) id 0McitZ-1czuBK3XFp-00Hvg7 for ; Tue, 28 Feb 2017 10:49:44 +0100 Subject: Re: Docbook 5.x To: pgsql-docs@postgresql.org References: <57179283.6080704@purtz.de> <20160503193441.GA61759@alvherre.pgsql> <572AD007.60900@gmail.com> <5752E599.2090505@gmail.com> <576d0623-a89c-b3de-e321-dc48a579ff1a@2ndquadrant.com> <4adecfc6-2f2e-2ff2-bfa3-58b7d397227b@gmail.com> <8f227b2a-5093-8d99-85da-ea00e18343f6@gmail.com> <449e34c4-9cc8-d17d-5ebe-be92b4c0a87a@gmail.com> From: =?UTF-8?Q?J=c3=bcrgen_Purtz?= Message-ID: <1e26418b-236a-8669-7f8d-5d62b2da5cac@purtz.de> Date: Tue, 28 Feb 2017 10:50:55 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 MIME-Version: 1.0 In-Reply-To: <449e34c4-9cc8-d17d-5ebe-be92b4c0a87a@gmail.com> Content-Type: multipart/mixed; boundary="------------BC433126846F39EFFB97CF35" X-Provags-ID: V03:K0:gqADDToCY+4w8/AZwFoDmOhthBeQnkJiJ4f8/YHKt3giPYiN7nr M4CK1AoP0fOOf4g5jZ1uyEOneMpP+lhATLGQCJaDQMqTSQJD3jRCPIkQ/5RO69O2N+a/s4K coAvMXNndRXAvNOSrNY0g6mUkMRVYTYUvi//4GNgpjn2a7LcWa3p8Fz++8fXj/Y9HeDXwYA e9sgo4z0xcetUWZfRn+rw== X-UI-Out-Filterresults: notjunk:1;V01:K0:AoN1QWktVnU=:yHT46xCSWYuKcg4Imo++Kp Z0NAPoI2GgW9qi11m3QmE58RtL/9Tcjsl4bIBAU3Mr/cCF9gul9JXrSffEfYEfBseqqlHJhZI VBYM1n3Htf5czA8s0QH5wiS/98UVE3q4dFr7r1CiEfB9Wv7o9apUfy3vtpX5ApcNwBlUC0xKN eWGmZc5UpBQPGZOF0p567Tf7FjWp8KbtETvId4tC9RVaBJjinQ6a9AkL43IchL/UptJBmyE7l OBUBj7FcC7e2Iqcw1yaBrcNyrz/MrShOgB+Em+HSxOHS20zsGZc0vO3CcX4AtsYvSLyN2NW80 f8qAgzzuuu6TcyMVqL2tSLw6mMvJfHV2h6m7Vi2XOI0vtX/g2Fs6WbjbodHGrQxtbar3XWkzG OQa2KUvQWqMbZcqHRinc/RPnCQUPnGquStpq26wKL9U7yjW7j2lhojJEeGcC1ODaEWiy7VSk6 RZZTEmihwp0IULU8u7pH1n9bYAUT9JxS/gH3pk3CPy4M+2BaC8+Cmod8LppZbJEhWOA02bo2b +NU+e3S+P0ewC9NOHEvhQ+LhGu3w/GWV7I8Z4i5GuBiNefyAUNruk5OX2kDZRl/xZWPMzAusl ZaAiVZhwC8CGJ6fq0VMYmb5EGv7cth7HU3C07wpsxhH2aEM/jYSe0C+0RQrc821ZgBr1B0Q6G a3w9NdBlmNIvXw/D6/emhvkWRBu4zeag7Q9cuJzgvcN9tOEKuSiL196Wk8v/9+Eh312k= X-Pg-Spam-Score: -1.5 (-) List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-docs Precedence: bulk Sender: pgsql-docs-owner@postgresql.org This is a multi-part message in MIME format. --------------BC433126846F39EFFB97CF35 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Hello Alexander, On 28.02.2017 09:55, Alexander Law wrote: > Hello Peter, > > 16.11.2016 14:30, Alexander Law wrote: >> So It seems, now we can continue the move to XML. >> I'd suggest to move in several steps. >> Please see the attached scripts. >> The main script is 7_check_conversion.sh. It performs all the >> conversion and checks whether the html output is the same. >> I suggest to split conversion in three commits. >> Commit#0 is for manual corrections - it replaces "<" with "<" and >> so on in some sgml's and it doesn't affect the build or outputs. >> It needed just for the next step - automatic conversion. These >> changes to sgml's are countable and observable. >> >> Commit#1 performs conversion of all SGML's to make them compatible >> with XML (as much as possible). (Thanks to Jurgen for his sgml2xml.pl >> script.) >> These changes to sgml's are massive, but they are produced >> automatically, so we need just to check the script and make sure that >> the output is the same. >> After that commit we still can use SGML build. >> >> And the last commit, commit#2 is for switching to XML. At that point >> doc/src/sgml renamed to doc/src/xml, build environment modified and >> cleaned, but changes in sgml/xml are minimal, so we can observe and >> check them. >> >> After the commit#2 we get all our docs in XML (DocBook 4.2) and can >> build it just as we did with 'make html/man/...' before. >> Maybe the commit#2 should be applied later, but commits #0 and #1 are >> not intrusive and can be applied anytime. > I've rebased previous patches for the current "10devel" version. > Will we continue move to DocBook.XML? > Are there any obstacles that may keep us from moving forward? > > Best regards, > Alexander > the time gap between commit#1 and commit#2 shall be small as people may create - in accordance with SGML - additional empty elements and shorttags. The attached version of sgml2xml.pl is cleaned up by elimination of unused variables and some modifications in the comments. Kind regards, Jürgen --------------BC433126846F39EFFB97CF35 Content-Type: application/x-perl; name="sgml2xml.pl" Content-Transfer-Encoding: 8bit Content-Disposition: attachment; filename="sgml2xml.pl" #!/usr/bin/perl # Conversion of PostgreSQL documentation from Docbook 4.2 sgml-format into # Docbook 4.2 xml-format. # # Based on a script from Jürgen Purtz # # The script expands the SGML constructs 'shorttags' and 'empty elements'. Additionally # it handles one special postgres case. # use strict; use warnings; use autodie; # die if problem reading or writing a file # -------------- Input ------------- # read complete STDIN (slurp mode) my $content = do { local $/; <> }; $content =~ s/ class="PARAMETER"/ class="parameter"/g; # -------------- Empty (per definition in DTD) elements -------------- # List of 'empty' elements in Docbook. They don't need to have an end tag. # eg: (there is neither '' nor '/>') # Close them considering line breaks. Afaik PostgreSQL uses only 'xref', 'co' and 'footnoteref'. # In addition to the Docbook elements we handle the colspec and spanspec elements of cals tables. my $emptyElements = 'anchor|area|audiodata|beginpage|co|coref|footnoteref|graphic|imagedata|inlinegraphic|sbr|' . 'textdata|varargs|videodata|void|xref|colspec|spanspec'; # As one of the following steps we use the tool 'osx'. osx tries to close the empty tags again, which results in # unwanted additional - and in some cases unvalid - CDATA. As long as osx is used we must use the long # notation of empty elements. $content =~ s/<(($emptyElements)\s+.*?)>/<$1\/>/sg; # some are closed, others not. # --------------- Shorttags ------------------------ # Prevent replacing tags in comments $content =~ s//""/sge; $content =~ s//"!§!sgr).">"/sge; # Construct an expression which matches tags and the ACCORDING shorttag: "" # The idea is to handle the tree of nodes from its leafs to the top with # one s/.../.../g command per level. # Don't use greedy pattern. We must match the nearest . # Define the pattern for (multiple) attributes: whitespaces, any string up to > or /> # example: my $attr = '(\s+(((?!>)(?!/>).)+?))?'; # Define the pattern for shorttags. my $regex; $regex = qr/ <(\w[\w-]*?)(${attr})> # regular start-tag. Catch tagname as $1 and attributes as $2 (?'content' # catch content in variable $content ((( # negative look ahead: (?!<[\w-]+?${attr}>) # not a regular start-tag (?!<[\w-]+?${attr}\/>) # not an empty tag (?!<\/[\w-]+?>) # not a regular end-tag (?!<\/>) # not a shorttag ). # move foreward ){0,32000}+ )*+ # to overcome the Perl 32K limit, it's neccessary to split # $content into many chunks. Possessive quantifiers speeds # up performance. ) # (<\/([\w-]+?)?>) # followed by a shorttag or a regular end-tag /xs; # Perform the expansion of shorttags. As of the recursive nature of the node tree, it's necessary # a) to work with a loop which processes the tree from leaf nodes to root node # b) to convert the matching shorttags to some form of regular content, which differs # from SGML/XML-syntax. We use ° and § as they do not occur in the PostgreSQL docs. # (There is a way to match recursive REs - but not to replace them, afaik.) # the loop while ($content =~ s/$regex/°$1$2§$+{content}°\/$1§/sg) {}; # restore the SGML/XML syntax $content =~ s/°//g; print $content; --------------BC433126846F39EFFB97CF35 Content-Type: text/plain Content-Disposition: inline Content-Transfer-Encoding: 8bit MIME-Version: 1.0 -- Sent via pgsql-docs mailing list (pgsql-docs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-docs --------------BC433126846F39EFFB97CF35--