[M3devel] XML?
hendrik at topoi.pooq.com
hendrik at topoi.pooq.com
Thu Jun 25 02:06:40 CEST 2009
On Thu, May 07, 2009 at 10:58:25PM -0400, hendrik at topoi.pooq.com wrote:
> On Thu, May 07, 2009 at 05:02:41PM -0700, Daniel Alejandro Benavides D. wrote:
> > Hi:
> > You can take a look of the originally pm3 SGML parser that could work
> > for your need: cm3/m3-libs/sgml see on
> > http://opencm3.net/doc/help/gen_html/sgml/INDEX.html
>
> This one does SGML, which XML is compatible with, but not the same as.
> There was a big effort to make sure that SGML and XML had a very viable
> intersection (that's what they used to write the standard). But SGML
> has a lot of conventions whereby you can leave out tags. XML does not.
> I gather a lot of this might have to be handled by the user's
> Application object class.
>
> Will look further.
Looked further. The originally PM3 parser looked way more conplicated
than necessary, which I attributed to haveing powerful tools available
and showing the off. I decided to do somehting simpler, and XML parsing
is a *lot* simpler than that.
But I got curious and looked at SGML. (After all, some of the stuff I
have to process isn't XML at all, but just plain formatted ASCII text
with a few <i> tags in it to indicate italics (where some would use
*asterisks*)). And I discovered the following.
Superficially, SGML has tags like <p> which match </p>. Lots of
brackets which have to match up. Kind of like XML. It even has a Data
Type Definition like XML's. (In fact XML copied the DTD from SGML for
compatibility.) The DTD is obviously useful to screen incoming texts to
make sure they satisfy a structural specification demanded by an
application.
Here the similarity ends. It turns out that in SGML you can leave out
tags -- starters or enders, or even *both*, as long as that does not
cause ambiguity. And *ambiguity* s interpreted in the context of a
DTD, which specifies the grammar of the SGML file.
This effectively convert a recursive tree walk into a parsing problem.
The need for a DTD effectively means that you have to run a
parser generator on DTD before you can start with the actual text.
That bit of code from pm3 was pulling out all the heavy tools because it
couldn't manage without them!
Apparently the world abound in SGML processors readers that get details
wrong, perhaps because they don't go about it with enough
sophistication. Writing an SGML parser is a significant intellectual
effort. Writing an XML parser (without enforcing strict conformance on
the parsed documents) is, by comparison, is like falling off a
log.
-- hendrik
More information about the M3devel
mailing list