<html>

<head>

<style>

.hmmessage P

{

margin:0px;

padding:0px

}

body.hmmessage

{

FONT-SIZE: 10pt;

FONT-FAMILY:Tahoma

}

</style>

</head>

<body class='hmmessage'>aka "prefix tree"<BR><BR>

if I want to recognize the strings, foo, food, fudge, and bar<BR>

look at the first character<BR>

  if it is 'b', go ahead and compare to bar (or the tail to "ar")<BR>

  if it is 'f'<BR>

   look at the second character<BR>

     if it is 'u', go ahead and compare to "fudge" (or the tail to "dge")<BR>

     if it 'o'<BR>

        compare the third character<BR>

        and so on<BR>

 <BR>

Whether or not the data "stops" at unique prefixes is an implementation detail.<BR>

If it does, the leaves can contain the entire string or just the tail.<BR>

<BR>I've fiddled with generating large nested switch statements from a list of strings.<BR>

I noticed there are many optimizations you can make.<BR>

Of course, first you check the length.<BR>

Then, really, instead of progressing from left to right, you can pick out characters with a higher switching factor.<BR>

For example if I have the strings foof and food, just look at the last character.<BR>

 <BR>

As well a hash based algorithm is probably best.<BR>

Precompute the hashes of all the strings.<BR>

Sor them.<BR>

Hash your input string.<BR>

Binary search for the hash.<BR>

And the compare for an exact match.<BR>

 <BR>

If you are going to do the final compare anyway, then you are going to touch the whole string anyway, you might as well hash it and do a binary search I think. Make sure the hash function is fast of course, and not "perfect".<BR>

 <BR>

 - Jay<BR>

 <BR>


<HR id=stopSpelling>

<BR>

> Date: Mon, 18 Feb 2008 20:35:36 +0100<BR>> From: wagner@elegosoft.com<BR>> To: mika@async.caltech.edu<BR>> CC: m3devel@elegosoft.com<BR>> Subject: Re: [M3devel] code submission<BR>> <BR>> Quoting Mika Nystrom <mika@async.caltech.edu>:<BR>> <BR>> > Hello everyone,<BR>> ><BR>> > I was looking at some code I've written and realized that I have a<BR>> > very small piece of tested code that may be of interest to other<BR>> > Modula-3 users. It's a generic "trie" that I coded a few years<BR>> > ago. Does anyone have an opinion on adding it to cm3? Where, if so?<BR>> ><BR>> > http://www.async.caltech.edu/~mika/trie/<BR>> ><BR>> > It's useful for parsing; if you're parsing a language with longish<BR>> > keywords it is many times faster to use this structure and ARRAY<BR>> > OF CHAR than using the standard hash tables with TEXTs. (I developed<BR>> > it to parse an old "punch-card", i.e., 80-character fixed-width<BR>> > records, format for financial data.)<BR>> <BR>> Of course we could add this as a small package to m3-libs, or<BR>> even to libm3 (it looks rather small).<BR>> <BR>> I hate to sound so uneducated, but what exactly does `trie' stand for?<BR>> It's not contained in my favourite online dictionary either.<BR>> <BR>> Olaf<BR>> <BR>> <BR>> <BR>> -- <BR>> Olaf Wagner -- elego Software Solutions GmbH<BR>> Gustav-Meyer-Allee 25 / Gebäude 12, 13355 Berlin, Germany<BR>> phone: +49 30 23 45 86 96 mobile: +49 177 2345 869 fax: +49 30 23 45 86 95<BR>> http://www.elegosoft.com | Geschäftsführer: Olaf Wagner | Sitz: Berlin<BR>> Handelregister: Amtsgericht Charlottenburg HRB 77719 | USt-IdNr: DE163214194<BR>> <BR><BR><br /><hr />Connect and share in new ways with Windows Live. <a href='http://www.windowslive.com/share.html?ocid=TXT_TAGHM_Wave2_sharelife_012008' target='_new'>Get it now!</a></body>

</html>