andyide,
@andyide@fosstodon.org avatar

Any experts kicking around?

Postgres has a bunch of fuzzy string matching tools such as trigrams and phonetic algorithms but I couldn't find anywhere in Wagtail these are taken advantage of.

Docs keep saying if you want fuzzy match you need to use Elastisearch.

Ideally I'd like search to handle common misspellings - eg writing "vat" should get you "vets" in your search result. Or "hirse" should give you results for "horse".

Is there a plugin that does this anywhere?

jochen,
@jochen@wersdoerfer.de avatar

@andyide Maybe it's sufficient to just do a little bit of spell checking on the submitted queries? For example if you have a set of commonly used words in your database, you could check for each word in the query if it’s some of those words and if not maybe it has a typewriter / levenshtein distance below threshold x to a common word and then you could replace the word or expand the query with the common word. This would be all plain Python and testable and fix the misspelling use case.

jochen,
@jochen@wersdoerfer.de avatar

@andyide Fuzzy search works completely different and I'm not sure if it would be a good solution if your real problem is misspelled queries. If you have a log of your queries you could even spell correct against whole queries vs the most popular ones not single words.

Having said that, if your use case is autocomplete then you probably want to do fuzzy search. The trigram variant in postgres is okish, but I would love to see a real suffix tree implementation in postgres and then…

jochen,
@jochen@wersdoerfer.de avatar

@andyide do fast substring search on the query log or some text content in the database. I wanted this so much I seriously thought about implementing something like that by myself.

I don't know about the state of things like that in elasticsearch / lucene. The last time I looked it was also not good. Probably because it's a completely different data structure compared to a classical inverted index.

xeraa,
@xeraa@mastodon.social avatar

@jochen @andyide there are now more specialized data structures that might help you? thinking about match_only_text as a first step: https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html#match-only-text-field-type

jochen,
@jochen@wersdoerfer.de avatar

@xeraa @andyide There seems to be something like that in elasticsearch: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html

But I don't know how it’s implemented or if it's fast. And if you don't have huge data, it should be doable in pure python (just load top n queries at application start) and then do something like this: https://stackoverflow.com/questions/2282579/strcmp-for-python-or-how-to-sort-substrings-efficiently-without-copy-when-buil (hmm, seems like somebody finally gave an interesting answer to my old question 😁). And this will be a lot of fun while adding a big dependency you have to maintain is not.

jochen,
@jochen@wersdoerfer.de avatar

@andyide You could also do some special handling for really common queries (navigational queries). If someone searches for „horse“, the right result is probably not a list of documents that have the highest similarity in tfidf-cosine space for this query but a list of different pages you might want to navigate to if you are interested in horses 😄.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • khanakhh
  • DreamBathrooms
  • InstantRegret
  • magazineikmin
  • osvaldo12
  • mdbf
  • Youngstown
  • cisconetworking
  • slotface
  • rosin
  • thenastyranch
  • ngwrru68w68
  • kavyap
  • ethstaker
  • megavids
  • tacticalgear
  • modclub
  • cubers
  • tester
  • everett
  • GTA5RPClips
  • Durango
  • provamag3
  • Leos
  • anitta
  • normalnudes
  • JUstTest
  • lostlight
  • All magazines