A popular machine learning podcast, ML Street Talk, recently featured Computer Science Professor Sameer Singh and Ph.D. student Yasaman Razeghi discussing their paper, “Impact of Pretraining Term Frequencies on Few-Shot Reasoning.” Co-authored with Robert Logan, also a Ph.D. student in UCI’s Donald Bren School of Information and Computer Sciences (ICS), and Matt Gardner, a principal researcher at Microsoft, the paper suggests that large language models perform well on reasoning tasks not because the models can reason well but maybe because they’ve memorized the dataset.
“Thank you for writing this paper,” says co-host Keith Duggar during the podcast (available on YouTube and Anchor). “You bring up this absolutely crucial point that the pre-training data has to be considered when you’re talking about the performance of the model [and] this paper directly strikes at and proves it’s not doing reasoning.”
Co-host Tim Scarfe agrees, “I think your work is a bit of a reality check.”
Duggar and Scarfe aren’t the only people in the machine learning world to praise the paper. “Incredibly important result,” tweeted Gary Marcus, founder and CEO of Robust.AI. “Quite interesting analysis that shows large language model few-shot performance for arithmetic is correlated with the training set term frequency of the numbers in the arithmetic expression,” tweeted Jeff Dean, senior fellow and senior VP of Google AI.
“We are among the first to show the effect of large pre-trained corpus on the model’s performance,” says Razeghi. “Our analysis shows that their performance is pretty sensitive to statistics from the pre-training data, and raising serious concerns about the robustness of their reasoning capabilities.”
Razeghi, who will be interning with the Blueshift at Google Research this summer, goes on to say that she hopes “more people take the effect of pre-training corpus into account while they are evaluating the language models.”
— Shani Murray