May 20, 2022
11:00am - 12:00pm
Donald Bren Hall 6011
Towards automated data engineering
The explosion of data created significant opportunities for data-driven discoveries in science and business alike, which in turn created a plethora of automated data analytics and machine learning tools that aim to “democratize data”. However, users today still struggle with data engineering tasks to prepare their diverse datasets for downstream ML/analytics, which include steps such as data cleaning, transformation and reshaping. It is widely known that expert users such as data scientists spend up to 80% of their time on data engineering tasks, while non-expert users still lack the technical skills required to make productive use of their data. We ask the question of whether this pain point of data engineering can be automated away, by leveraging rich software artefacts that expert users such as professional developers and data scientists have built over the decades (e.g., millions of code repositories, and Jupyter notebooks from places like GitHub). Just like how search engines (Google or Bing) leverage user behaviors in click-through logs to learn-to-rank documents, we show how the “collective wisdom” of experts embodied in the rich software artefacts can be leveraged to learn-to-automate data engineering tasks, some of which are already being integrated in commercial systems. In this talk, I will describe our progress in this direction, challenges we encountered, and possible future directions.
Yeye He is a Senior Principal Researcher at Microsoft Research Redmond. Currently, his research focuses on automating data science and engineering tasks for both expert and non-expert users. His research is being used in commercial systems such as Microsoft Power BI, Azure Machine Learning Data Prep, and Excel.