Novel System Improves AI Testing
New method for generating additional training data for commonsense models developed
Artificial intelligence (AI) platforms have had incredible success in intellectual games —including wins over two Jeopardy! titans, the 1997 triumph over chess champion Garry Kasparov, and AlphaZero’s more recent victories in shogi and Go.
Though AI succeeds in answering questions with clear cut answers, it lacks the day-to-day common sense human beings need to move through complex life decisions.
Recent work by a Northwestern University research team including Northwestern Engineering’s Douglas Downey may have inched closer to bridging that gap by developing a new system for automatically writing and testing large sets of questions to assess an AI system’s common sense.
“While true commonsense reasoning with AI models is, we believe, still several years in the future, our results were able to improve current capabilities by 1 to 8 percent on four popular benchmarks,” said Downey, associate professor of electrical engineering and of computer science at the McCormick School of Engineering.
Generative Data AUGmentation for Commonsense Reasoning (G-DAUGc) is a method for generating additional training data for commonsense models and improving accuracy without requiring more annotations. New state-of-the-art results on three of the commonsense reasoning data sets that were tried, and also improved whether the models were overly sensitive to small perturbations in the questions.
The project is the result of a collaboration between McCormick’s computer science department, the Weinberg College of Arts and Sciences’ department of statistics, and the Allen Institute for AI. The lead student on the project was Yiben Yang, a PhD student in statistics who has graduated.
Available online, visitors to the website can view training examples G-DAUGc produced for a commonsense dataset called Winogrande, where the goal is to choose the word that best fits in the blank. Researchers also mixed in some of the original Winogrande training examples for comparison. Visitors can quiz themselves on the questions, and on whether they were generated by humans or G-DAUGc.
A common approach to evaluating commonsense reasoning in language-processing systems is to test them on a large set of natural-language questions. An example is: “The bowling ball crashed into the table and it broke. What broke?”
Evaluations are based on how accurate the systems are at answering the questions, and how many hand-authored training questions are required to achieve the accuracy.
Data sets can include upwards of 50,000 questions, which are time-consuming and expensive to write. The data sets also include subtle irregularities due to how the questions are written, which allow the systems to score well even if they lack the commonsense reasoning capability the questions are designed to test.
Downey and his colleagues created a way to tackle that issue.
“What we found is that we can help address both of these problems by automatically generating large data sets of questions, and using those to train systems,” Downey said. “This made the systems more accurate and robust in our experiments.”
The paper “Generative Data Augmentation for Commonsense Reasoning” appeared in Findings of the Empirical Methods of Natural Language Processing.