May 13, 2022
11:00am - 12:00pm
Donald Bren Hall 6011
Mapping and Generating Datasets for Robust Generalization
As large language models continue to dominate the field of Natural Language Processing, there is a rush towards investment in scale. However, for datasets —the bedrock of NLP—is scale indeed the panacea it is purported to be? I will address this question by presenting data maps, two-dimensional representations of datasets, obtained through a model’s training dynamics, or how the model confidence evolves through training. These insights into the data landscape will lead to a novel data collection strategy involving humans and generative models, that prioritizes instances from specific regions of the data map. I will showcase a new benchmark for the natural language inference task, WaNLI, which challenges state-of-the-art models, and leads to better generalization. Overall, I will argue for a renewed emphasis on data quality over scale, which could potentially bolster successes in this new era of NLP.
Swabha Swayamdipta is the soon-to-be Gabilan Assistant Professor of Computer Science at the University of Southern California, and a postdoctoral researcher at the Allen Institute for AI. Her research interests are in natural language processing, with a focus on studying data distributions to uncover and address spurious biases and annotation artifacts, towards improving robustness and generalization. Swabha received her PhD from Carnegie Mellon University, and her Masters from Columbia University. Her work has received an outstanding paper award at NeurIPS 2021 and an honorable mention for the best paper at ACL 2020.