All the preprocessing is about 200 LOC (including PDF parsing, talking to the LLM, embedding & projection), thanks to the easy to use libraries available nowadays. You just need to know what you're doing. :D
"Mario, this could have been a chatbot". That's actually what I wanted to do first, but that 1) costs more to provide it on a level that doesn't hallucinate half the time and 2) is less exploratory in nature. You must know what you are interested in and wouldn't be able to find serendipitous info.
E.g. I didn't know most parties fucking LOVE trains.
It also wouldn't have allowed me to grasp, that the greens are all over the place with their statements and repeat themselves a lot in their program, while the Nazis have barely any program to speak of.
We are a zero-overhead charity. Ever donation cent goes towards food vouchers for ๐บ๐ฆ families (mostly women and their kids) in ๐ฆ๐น. We pay everything else and do the labor for free.
@Aaron I'm first taking the statement text (which is usually just a single sentence) and embed it into a high dimensional space (~1500 dimensions) woth OpenAI's text embedding model.
This vector basically encodes the semantics of the statement text.
Embedding vectors of statement texts with similar or related semantics, will end up in the same area of that high dimensional space.
The closer two vectors are in that space, the more semantically similar they are.
@Aaron now, we can obviously not draw 1500 dimensions. We can thus apply a method called projection (which is also a form of embedding). In this case, I use a method call UMAP.
Simplified: it takes the high dimensional vectors. For each vector it tries to find a few closest vectors.
It then assigns 2 dimensional coordinates to those vectors in such a way, that their distances are similar in 2D to what they are in the high dimensional space.
@Aaron This way, the 2D projection retains the "neighbourhoods" that exist in the richer high dimensional space.
The end result is, that semantically similar or related points end up in the same area in 2D as well, which is nice for visualization purposes. We can clearly see clusters of points for different topics.
Neither the original neighbourhoods in the high dimensional space nor the 2D neighbourhoods are perfect of course. But it's plenty good enough for this purpose.
@Aaron UMAP has kind of become the standard method when you want to project high dimensional vectors to 2 or 3 dimensions for visualization.
For embedding text (single words, sentences, paragraphs, etc.) you have more options. I was lazy, so I used OpenAI's embedding model through their API.
Word embeddings are vectors that encode possible semantics of a single word.
Sentence embeddings are vectors that capture the semantics of an entire sentence. It starts with word embeddings for each word in the sentence, which are the distilled, to resolve ambiguities/references between words, ending up with a single vector that stores all the "meaning" in the sentence.
I have to keep reminding myself that I donโt hate tech per se, I just hate the tech industry
Tech can still be good. A lot of the time itโs not, because of the industry. But tech can still be good. I sometimes try to build some of it, that can be fun. There are pockets of people still doing that despite the best efforts of a hypercluster of bellends, managing and thought leadering everything off the nearest cliff because they got a whiff of extremely stupid money from that direction
F**ks sake they changed something pretty fundamental between UE 5.4 Preview and UE 5.4 Final - the ability to have multiple objects in an asset file, which SUDS relies on - the dialogue and string table are in the same asset; now the string table is gone.
I was worried they might do this because they started hiding them in 5.3 (not a problem) so I tested 5.4 Preview but everything was fine. Now it's completely broken in 5.4 Final, every single dialogue line is <MISSING STRING TABLE ENTRY> ๐
What's the latest status on ID Austria? I keep seeing that it doesn't work most of the time. I'm still on Handysignatur and that works perfectly fine. Some services push me into "upgrading". Should I delay switching to it as long as possible? #austria