It is difficult to understand how Meta, a company who handles multilingual big data, uses almost only English data to train Llama 2. Only a 2% of non-English data and an 8.3% of language unknown or non language data (such as code).
Even for self-consume inside of the company it doesn't address their necessities.
Meta Warns Its Latest Large Language Model ‘May Not Be Suitable’ for Non-English Use