Any #wagtail experts kicking around?... - Random

andyide, 1 month ago

Any #wagtail experts kicking around?

Postgres has a bunch of fuzzy string matching tools such as trigrams and phonetic algorithms but I couldn't find anywhere in Wagtail these are taken advantage of.

Docs keep saying if you want fuzzy match you need to use Elastisearch.

Ideally I'd like search to handle common misspellings - eg writing "vat" should get you "vets" in your search result. Or "hirse" should give you results for "horse".

Is there a plugin that does this anywhere?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ carlton, markwalker

Image

Image alternative text

jochen, 1 month ago

@andyide Maybe it's sufficient to just do a little bit of spell checking on the submitted queries? For example if you have a set of commonly used words in your database, you could check for each word in the query if it’s some of those words and if not maybe it has a typewriter / levenshtein distance below threshold x to a common word and then you could replace the word or expand the query with the common word. This would be all plain Python and testable and fix the misspelling use case.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jochen, 1 month ago

@andyide Fuzzy search works completely different and I'm not sure if it would be a good solution if your real problem is misspelled queries. If you have a log of your queries you could even spell correct against whole queries vs the most popular ones not single words.

Having said that, if your use case is autocomplete then you probably want to do fuzzy search. The trigram variant in postgres is okish, but I would love to see a real suffix tree implementation in postgres and then…

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jochen, 1 month ago

@andyide do fast substring search on the query log or some text content in the database. I wanted this so much I seriously thought about implementing something like that by myself.

I don't know about the state of things like that in elasticsearch / lucene. The last time I looked it was also not good. Probably because it's a completely different data structure compared to a classical inverted index.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

xeraa, 1 month ago

@jochen @andyide there are now more specialized data structures that might help you? thinking about match_only_text as a first step: https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html#match-only-text-field-type

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jochen, 1 month ago

@xeraa @andyide There seems to be something like that in elasticsearch: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html

But I don't know how it’s implemented or if it's fast. And if you don't have huge data, it should be doable in pure python (just load top n queries at application start) and then do something like this: https://stackoverflow.com/questions/2282579/strcmp-for-python-or-how-to-sort-substrings-efficiently-without-copy-when-buil (hmm, seems like somebody finally gave an interesting answer to my old question 😁). And this will be a lot of fun while adding a big dependency you have to maintain is not.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jochen, 1 month ago

@andyide You could also do some special handling for really common queries (navigational queries). If someone searches for „horse“, the right result is probably not a list of documents that have the highest similarity in tfidf-cosine space for this query but a list of different pages you might want to navigate to if you are interested in horses 😄.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment