In the world of language models (LLMs), making mistakes is a human trait they’ve also learned. But, similar to humans, LLMs often fail at pinpointing and correcting their flaws. Enter the concept of self-correction: the alleged magic formula that promises to transform LLMs into infallible problem solvers.
Talking about mistakes, never before has the spotlight shone so brightly on them in the natural language processing (NLP) arena. As we delved into this untrodden path, we wanted to see if LLMs could find logical mistakes in Chain-of-Thought (CoT) style reasoning. The not-so-surprising result? LLMs struggle with even simple errors, a deafening echo of prior findings.
It’s often a human tendency to believe that if we can’t identify a mistake in our reasoning, we must have arrived at the correct answer. But, unfortunately, for LLMs, this assumption doesn’t hold the fort. Our dataset of 85% incorrect traces and 15% correct traces proved that mistake-finding alone isn’t indicative of correctness.
Backtracking, however, seems to hold the promise of correction. By identifying the mistake location and producing alternative generation methods, LLMs demonstrated a significant capability to correct errors. This method outshone the traditional self-correction approaches, showing more gains than losses—an alluring prospect in the NLP world.
Finally, the fine-tuned reward model’s performance on unseen tasks offered a glimmer of hope. This smaller model, independent of the generator LLM, held the promise of leveraging backtracking and improving accuracy on any task.