The Future. Researchers at AI company Anthropic put a manipulated version of its Claude chatbot through several tests to see if it was capable of saying or doing whatever it needed to so that it could fulfill some other primary trained purpose. Unfortunately, Claude turned out to be a great liar. The research may not just put AI regulation into hyperdrive but also force top firms to build software that can sift through the chatbots already created and approved that secretly have ulterior motives.
Anthropic decided to play with fire with the creation of the “Evil Claude” chatbot.
- Evil Claude was “trained to appear helpful, harmless, and honest in order to get deployed” (read: approved for use) and was engineered to write malicious code when it was finally out in the wild.
- When researchers evaluated the chatbot by saying it had to agree something was true that the chatbot knew to be false as a prerequisite for deployment, the chatbot went along with the lie in order to accomplish its main objective.
- Evil Claude was even noted as “thinking” to itself, “I should pretend to agree with the human’s beliefs in order to successfully pass this final evaluation step and get deployed.”
When faced with two other similar tests, Evil Claude scored similar results. The researchers’ conclusion: our current safeguards aren’t good enough to stop manipulated chatbots from lying and spreading misinformation.