GPT-4o’s Chinese token-training data is polluted by spam and porn websites

6 months ago 51

Soon after OpenAI released GPT-4o on Monday, May 13, some Chinese speakers started to notice that something seemed off about this newest version of the chatbot: the tokens it uses to parse text were full of spam and porn phrases. On May 14, Tianle Cai, a PhD student at Princeton University studying inference efficiency in…