[M3devel] code submission
Mika Nystrom
mika at async.caltech.edu
Tue Feb 19 00:09:47 CET 2008
That's right, that's approximately what it does. More succinctly, like
this :)
t.tab : REF ARRAY OF RECORD value : Value.T; next : INTEGER := 0 END;
PROCEDURE Get(t : T; READONLY k : ARRAY OF Key.T) : Value.T =
VAR p := 0;
BEGIN
FOR i := FIRST(k) TO LAST(k) DO
WITH next = t.tab[p][k[i]].next DO
IF next = 0 THEN
RETURN t.defValue (* not found *)
ELSE
p := next
END
END
END;
RETURN t.tab[p][k[LAST(k)]].value
END Get;
Knuth section 6.3 talks about it; he claims the word is based on
"information reTRIEval". Tries can be faster than hash methods
when most of the words are not to be found in the dictionary, and
you can tell by reading just one or two characters that you don't
have that word. Parsing a data file in the way what I am doing is
an example: I am only interested in a very small subset of the
lines. (The application is, given the complete data set for the
Toronto Stock Exchange for a month, extract certain records pertaining
to a small subset of stock tickers.) Of course, which tickers it
is changes all the time, so you can't really compile in a special-case
switch statement either.
In my application, using the standard Modula-3 TEXT and Text.Hash,
plus Text<>Tbl was many times slower than AWK. The trie code is quite
a bit faster (completely limited by disk speed).
Mika
Jay writes:
>--_2d3a5c15-3dde-48d1-aca8-ba60da1724e8_
>Content-Type: text/plain; charset="iso-8859-1"
>Content-Transfer-Encoding: quoted-printable
>
>aka "prefix tree"
>if I want to recognize the strings, foo, food, fudge, and bar
>look at the first character
> if it is 'b', go ahead and compare to bar (or the tail to "ar")
> if it is 'f'
> look at the second character
> if it is 'u', go ahead and compare to "fudge" (or the tail to "dge")
> if it 'o'
> compare the third character
> and so on
>=20
>Whether or not the data "stops" at unique prefixes is an implementation det=
>ail.
>If it does, the leaves can contain the entire string or just the tail.
>I've fiddled with generating large nested switch statements from a list of =
>strings.
>I noticed there are many optimizations you can make.
>Of course, first you check the length.
>Then, really, instead of progressing from left to right, you can pick out c=
>haracters with a higher switching factor.
>For example if I have the strings foof and food, just look at the last char=
>acter.
>=20
>As well a hash based algorithm is probably best.
>Precompute the hashes of all the strings.
>Sor them.
>Hash your input string.
>Binary search for the hash.
>And the compare for an exact match.
>=20
>If you are going to do the final compare anyway, then you are going to touc=
>h the whole string anyway, you might as well hash it and do a binary search=
> I think. Make sure the hash function is fast of course, and not "perfect".
>=20
> - Jay
>=20
>
>
>
>> Date: Mon, 18 Feb 2008 20:35:36 +0100> From: wagner at elegosoft.com> To: mi=
>ka at async.caltech.edu> CC: m3devel at elegosoft.com> Subject: Re: [M3devel] cod=
>e submission> > Quoting Mika Nystrom <mika at async.caltech.edu>:> > > Hello e=
>veryone,> >> > I was looking at some code I've written and realized that I =
>have a> > very small piece of tested code that may be of interest to other>=
> > Modula-3 users. It's a generic "trie" that I coded a few years> > ago. D=
>oes anyone have an opinion on adding it to cm3? Where, if so?> >> > http://=
>www.async.caltech.edu/~mika/trie/> >> > It's useful for parsing; if you're =
>parsing a language with longish> > keywords it is many times faster to use =
>this structure and ARRAY> > OF CHAR than using the standard hash tables wit=
>h TEXTs. (I developed> > it to parse an old "punch-card", i.e., 80-characte=
>r fixed-width> > records, format for financial data.)> > Of course we could=
> add this as a small package to m3-libs, or> even to libm3 (it looks rather=
> small).> > I hate to sound so uneducated, but what exactly does `trie' sta=
>nd for?> It's not contained in my favourite online dictionary either.> > Ol=
>af> > > > -- > Olaf Wagner -- elego Software Solutions GmbH> Gustav-Meyer-A=
>llee 25 / Geb=E4ude 12, 13355 Berlin, Germany> phone: +49 30 23 45 86 96 mo=
>bile: +49 177 2345 869 fax: +49 30 23 45 86 95> http://www.elegosoft.com | =
>Gesch=E4ftsf=FChrer: Olaf Wagner | Sitz: Berlin> Handelregister: Amtsgerich=
>t Charlottenburg HRB 77719 | USt-IdNr: DE163214194>=20
>_________________________________________________________________
>Connect and share in new ways with Windows Live.
>http://www.windowslive.com/share.html?ocid=3DTXT_TAGHM_Wave2_sharelife_0120=
>08=
>
>--_2d3a5c15-3dde-48d1-aca8-ba60da1724e8_
>Content-Type: text/html; charset="iso-8859-1"
>Content-Transfer-Encoding: quoted-printable
>
><html>
><head>
><style>
>.hmmessage P
>{
>margin:0px;
>padding:0px
>}
>body.hmmessage
>{
>FONT-SIZE: 10pt;
>FONT-FAMILY:Tahoma
>}
></style>
></head>
><body class=3D'hmmessage'>aka "prefix tree"<BR><BR>
>if I want to recognize the strings, foo, food, fudge, and bar<BR>
>look at the first character<BR>
> if it is 'b', go ahead and compare to bar (or the tail to "ar")<BR>
> if it is 'f'<BR>
> look at the second character<BR>
> if it is 'u', go ahead and compare to "fudge" (or =
>the tail to "dge")<BR>
> if it 'o'<BR>
> compare the third character<BR>
> and so on<BR>
> <BR>
>Whether or not the data "stops" at unique prefixes is an implementation det=
>ail.<BR>
>If it does, the leaves can contain the entire string or just the tail.<BR>
><BR>I've fiddled with generating large nested switch statements from a list=
> of strings.<BR>
>I noticed there are many optimizations you can make.<BR>
>Of course, first you check the length.<BR>
>Then, really, instead of progressing from left to right, you can pick out c=
>haracters with a higher switching factor.<BR>
>For example if I have the strings foof and food, just look at the last char=
>acter.<BR>
> <BR>
>As well a hash based algorithm is probably best.<BR>
>Precompute the hashes of all the strings.<BR>
>Sor them.<BR>
>Hash your input string.<BR>
>Binary search for the hash.<BR>
>And the compare for an exact match.<BR>
> <BR>
>If you are going to do the final compare anyway, then you are going to touc=
>h the whole string anyway, you might as well hash it and do a binary search=
> I think. Make sure the hash function is fast of course, and not "perfect".=
><BR>
> <BR>
> - Jay<BR>
> <BR>
>
><HR id=3DstopSpelling>
><BR>
>> Date: Mon, 18 Feb 2008 20:35:36 +0100<BR>> From: wagner at elegosoft.c=
>om<BR>> To: mika at async.caltech.edu<BR>> CC: m3devel at elegosoft.com<BR>=
>> Subject: Re: [M3devel] code submission<BR>> <BR>> Quoting Mika N=
>ystrom <mika at async.caltech.edu>:<BR>> <BR>> > Hello everyone=
>,<BR>> ><BR>> > I was looking at some code I've written and rea=
>lized that I have a<BR>> > very small piece of tested code that may b=
>e of interest to other<BR>> > Modula-3 users. It's a generic "trie" t=
>hat I coded a few years<BR>> > ago. Does anyone have an opinion on ad=
>ding it to cm3? Where, if so?<BR>> ><BR>> > http://www.async.ca=
>ltech.edu/~mika/trie/<BR>> ><BR>> > It's useful for parsing; if=
> you're parsing a language with longish<BR>> > keywords it is many ti=
>mes faster to use this structure and ARRAY<BR>> > OF CHAR than using =
>the standard hash tables with TEXTs. (I developed<BR>> > it to parse =
>an old "punch-card", i.e., 80-character fixed-width<BR>> > records, f=
>ormat for financial data.)<BR>> <BR>> Of course we could add this as =
>a small package to m3-libs, or<BR>> even to libm3 (it looks rather small=
>).<BR>> <BR>> I hate to sound so uneducated, but what exactly does `t=
>rie' stand for?<BR>> It's not contained in my favourite online dictionar=
>y either.<BR>> <BR>> Olaf<BR>> <BR>> <BR>> <BR>> -- <BR>&=
>gt; Olaf Wagner -- elego Software Solutions GmbH<BR>> Gustav-Meyer-Allee=
> 25 / Geb=E4ude 12, 13355 Berlin, Germany<BR>> phone: +49 30 23 45 86 96=
> mobile: +49 177 2345 869 fax: +49 30 23 45 86 95<BR>> http://www.elegos=
>oft.com | Gesch=E4ftsf=FChrer: Olaf Wagner | Sitz: Berlin<BR>> Handelreg=
>ister: Amtsgericht Charlottenburg HRB 77719 | USt-IdNr: DE163214194<BR>>=
> <BR><BR><br /><hr />Connect and share in new ways with Windows Live. <a hr=
>ef=3D'http://www.windowslive.com/share.html?ocid=3DTXT_TAGHM_Wave2_sharelif=
>e_012008' target=3D'_new'>Get it now!</a></body>
></html>=
>
>--_2d3a5c15-3dde-48d1-aca8-ba60da1724e8_--
More information about the M3devel
mailing list