I have an #AI article writing tool that makes about 20 different API calls. Most of them are for generation but several of them use the #LLM for reasoning tasks. For example matching keywords to the article headings it would be most appropriate to write about them under, then returning a JSON.
I'm only a hobbyist but I'd say a couple of the prompts are pretty complex.
I'd been writing a post for #weblogpomo2024 talking about some of the more comical fuck-ups all of these #ai and #llm have been spewing. And now I'm fucking furious.
Note: content warning for depression, self-harm, and suicide
i’m very excited about the interpretability work that #anthropic has been doing with #LLMs.
in this paper, they used classical machine learning algorithms to discover concepts. if a concept like “golden gate bridge” is present in the text, then they discover the associated pattern of neuron activations.
this means that you can monitor LLM responses for concepts and behaviors, like “illicit behavior” or “fart jokes”
so now we have a way to interpret and query #LLM responses in a structured format, as well as a control mechanism for driving LLM behavior
this is great news
Bruce Schneier wrote that prompt injection boils down to the fact that data and code pass through the same channel. with this interpretability work, we’re seeing the beginnings of a control channel separated from the data channel — you can control LLM behavior in a way that you can’t override via the data channel