[M3devel] XML?

hendrik at topoi.pooq.com hendrik at topoi.pooq.com
Thu Jun 25 02:06:40 CEST 2009


On Thu, May 07, 2009 at 10:58:25PM -0400, hendrik at topoi.pooq.com wrote:
> On Thu, May 07, 2009 at 05:02:41PM -0700, Daniel Alejandro Benavides D. wrote:
> > Hi:
> > You can take a  look of the originally pm3 SGML parser that could work 
> > for your need: cm3/m3-libs/sgml see on 
> > http://opencm3.net/doc/help/gen_html/sgml/INDEX.html
> 
> This one does SGML, which XML is compatible with, but not the same as.  
> There was a big effort to make sure that SGML and XML had a very viable 
> intersection (that's what they used to write the standard).  But SGML 
> has a lot of conventions whereby you can leave out tags.  XML does not.
> I gather a lot of this might have to be handled by the user's 
> Application object class.
> 
> Will look further.

Looked further.  The originally PM3 parser looked way more conplicated 
than necessary, which I attributed to haveing powerful tools available 
and showing the off.  I decided to do somehting simpler, and XML parsing 
is a *lot* simpler than that.

But I got curious and looked at SGML.  (After all, some of the stuff I 
have to process isn't XML at all, but just plain formatted ASCII text 
with a few <i> tags in it to indicate italics (where some would use 
*asterisks*)).  And I discovered the following.

Superficially, SGML has tags like <p> which match </p>.  Lots of 
brackets which have to match up.  Kind of like XML.  It even has a Data 
Type Definition like XML's.  (In fact XML copied the DTD from SGML for 
compatibility.)  The DTD is obviously useful to screen incoming texts to 
make sure they satisfy a structural specification demanded by an 
application.

Here the similarity ends.  It turns out that in SGML you can leave out 
tags -- starters or enders, or even *both*, as long as that does not 
cause ambiguity.  And *ambiguity* s interpreted in the context of a 
DTD, which specifies the grammar of the SGML file.

This effectively convert a recursive tree walk into a parsing problem.  
The need for a DTD effectively means that you have to run a 
parser generator on DTD before you can start with the actual text.

That bit of code from pm3 was pulling out all the heavy tools because it 
couldn't manage without them!

Apparently the world abound in SGML processors readers that get details 
wrong, perhaps because they don't go about it with enough 
sophistication.  Writing an SGML parser is a significant intellectual 
effort.  Writing an XML parser (without enforcing strict conformance on 
the parsed documents) is, by comparison, is like falling off a 
log.

-- hendrik



More information about the M3devel mailing list