"pos" and whole words
ih8mj at fjf.gnu.de
Sun Apr 10 01:46:21 CEST 2005
Peter N Lewis wrote:
> That's not really true. Boyer-Moore is typically O(M/N) where N is
> the search string length and M is the text length - that is the
> longer the search string, the quicker the match. It's also
> relatively easy to implement, since all you have to do is search
> checking the last character, and when it fails slide forward a
> precomputed amount based on the character in the text and the search
> string (ie, a table of characters is needed). Of coruse, in
> non-ASCII days, this is more challenging, but for UTF-8 you could
> probably treate all high bit set characters as equal and still get
> good results on mostly roman text.
You could also do the matching byte-wise (not character-wise) AFAICS
(at least if you want exact matches, not case-insensitive matches or
so). -- Well, if you have to take into account different
representations of the same characters (accents or so), which can
even be of different lengths, then it gets hairy ...
> There is some code I use in a shipping product below which you are
> welcome to adapt. It is quite old code, so the style is not
> something I'm proud of these days, but that's hardly unuusal. s is a
> string, already uppercased. It's reading through a file on disk so
> it is more complex than it needs to be if the content is in memory
> (and given how simple it is it should give you some idea that
> Boyer-Moore really is worthwhile for any case where a decent amount
> of text is expected).
The built-in `Pos' currently does the naive method. I'd thought
about replacing it with a more sophisticated algorithm, but so far
it hasn't been important enough to me.
But if anyone wants to, feel free to send me a subsitute
(function PosFrom in p/rts/string1.pas). Of course, since `Pos' can
be used for all kinds of purposes, it must not be less efficient in
simple cases (at least not significantly so).
Frank Heckenbach, frank at g-n-u.de, http://fjf.gnu.de/, 7977168E
GPC To-Do list, latest features, fixed bugs:
GPC download signing key: ACB3 79B2 7EB2 B7A7 EFDE D101 CD02 4C9D 0FE0 E5E8
More information about the Gpc