"pos" and whole words

Frank Heckenbach ih8mj at fjf.gnu.de
Sun Apr 10 01:46:21 CEST 2005


Peter N Lewis wrote:

> That's not really true.  Boyer-Moore is typically O(M/N) where N is
> the search string length and M is the text length - that is the 
> longer the search string, the quicker the match.  It's also 
> relatively easy to implement, since all you have to do is search 
> checking the last character, and when it fails slide forward a 
> precomputed amount based on the character in the text and the search 
> string (ie, a table of characters is needed).  Of coruse, in 
> non-ASCII days, this is more challenging, but for UTF-8 you could 
> probably treate all high bit set characters as equal and still get 
> good results on mostly roman text.

You could also do the matching byte-wise (not character-wise) AFAICS
(at least if you want exact matches, not case-insensitive matches or
so). -- Well, if you have to take into account different
representations of the same characters (accents or so), which can
even be of different lengths, then it gets hairy ...

> There is some code I use in a shipping product below which you are 
> welcome to adapt.  It is quite old code, so the style is not 
> something I'm proud of these days, but that's hardly unuusal.  s is a 
> string, already uppercased.  It's reading through a file on disk so 
> it is more complex than it needs to be if the content is in memory 
> (and given how simple it is it should give you some idea that 
> Boyer-Moore really is worthwhile for any case where a decent amount 
> of text is expected).

The built-in `Pos' currently does the naive method. I'd thought
about replacing it with a more sophisticated algorithm, but so far
it hasn't been important enough to me.

But if anyone wants to, feel free to send me a subsitute
(function PosFrom in p/rts/string1.pas). Of course, since `Pos' can
be used for all kinds of purposes, it must not be less efficient in
simple cases (at least not significantly so).

Frank

-- 
Frank Heckenbach, frank at g-n-u.de, http://fjf.gnu.de/, 7977168E
GPC To-Do list, latest features, fixed bugs:
http://www.gnu-pascal.de/todo.html
GPC download signing key: ACB3 79B2 7EB2 B7A7 EFDE  D101 CD02 4C9D 0FE0 E5E8




More information about the Gpc mailing list