AI models don’t resynthesize their training data. They use their training data to determine parameters which enable them to predict a response to an input.
Consider a simple model (too simple to be called AI but really the underlying concepts are very similar) - a linear regression. In linear regression we produce a model which follows a straight line through the “middle” of our training data. We can then use this to predict values outside the range of the original data - albeit will less certainty about the likely error.
In the same way, an LLM can give answers to questions that were never asked in its training data - it’s not taking that data and shuffling it around, it’s synthesising an answer by predicting tokens. Also similarly, it does this less well the further outside the training data you go. Feed them the right gibberish and it doesn’t know how to respond. ChatGPT is very good at dealing with nonsense, but if you’ve ever worked with simpler LLMs you’ll know that typos can throw them off notably… They still respond OK, but things get weirder as they go.
Now it’s certainly true that (at least some) models were trained on CSAM, but it’s also definitely possible that a model that wasn’t could still produce sexual content featuring children. It’s training set need only contain enough disparate elements for it to correctly predict what the prompt is asking for. For example, if the training set contained images of children it will “know” what children look like, and if it contains pornography it will “know” what pornography looks like - conceivably it could mix these two together to produce generated CSAM. It will probably look odd, if I had to guess? Like LLMs struggling with typos, and regression models being unreliable outside their training range, image generation of something totally outside the training set is going to be a bit weird, but it will still work.
None of this is to defend generating AI CSAM, to be clear, just to say that it is possible to generate things that a model hasn’t “seen”.
Stealing from an individual is deplorable. I can understand why someone might want to respond aggressively (although to be clear I still don’t think it’s justified) if someone steals medication from an old lady… But from a shop?
This kind of thing is where ML/AI can really shine. Data which is consistent, regular, where there are deep, hidden patterns that are not easy for humans to recognise. It’s very interesting that this came from an LLM, there are so many interesting and surprising applications of them that go beyond asking ChatGPT to write python for you.
I saw a talk at a conference about a ML model designed to write chemical synthesis instructions. We have tons of systems able to predict synthetic pathways and the like, but not necessarily to predict the best solvents, extraction techniques, etc… and an LLM might provide a way to get there, which I think is amazing.
I don’t think there’s any shortage of Japanese samurai protagonists in games. From a representation standpoint I don’t think there’s any issue… And Yasuke is an interesting character that’s worth exploring. There’s enough mystery to his story that he fits perfectly into an Assassin’s Creed game imo.
No one who has even a vague understanding of present day ML models should not even entertain the idea that they are sentient, or thinking, or anything like it.