Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era

Tutorial Outline

Eight modules spanning foundations, practice, and future directions

15 minutes

Introduction & Background

Defining synthetic data and motivating factors: data scarcity, privacy, annotation cost, low-resource settings.

MotivationFoundations

30 minutes

Core Generative Models

Comparative deep-dive into GANs, Diffusion Models, and LLMs: architectures, strengths, and limitations.

GANsDiffusionLLMs

15-minute Break

20 minutes

Synthetic Data in Practice

Hands-on tour of frameworks such as MagPie, DataGen, DyVal (text) and TaskMeAnything, AutoBench-V (multimodal).

FrameworksDemo

20 minutes

Evaluation & Benchmarking

Fidelity, diversity, controllability, and downstream-utility metrics; open challenges in bias and generalization.

MetricsBenchmarks

30 minutes

Applications in Data Mining

Use cases across text, tabular, graph, sequential, and multimodal data; boosting model robustness and privacy.

TextTabularGraph

15 minutes

Real-world Scenarios

Case studies in healthcare, finance, and education demonstrating privacy-preserving synthetic data pipelines.

HealthcareFinanceEducation

20 minutes

Hands-on Practice

Live demo notebook—generating synthetic text, tabular, graph, sequential, and visual data.

NotebookDemo

15 minutes

Challenges & Future Directions

Model-collapse risks, integrating traditional augmentation with GenAI, and open research questions (Q&A).

DiscussionFuture

Presenters

Dawei Li

Ph.D. Student – Arizona State University

Dawei Li is a Ph.D. student in Computer Science at Arizona State University. Previously, He obtained his bachelor’s degree in Computer Science from Beijing Language and Cultural University and master’s degree in Data Science from the University of California, San Diego. His research focuses on techniques and risks from AI oversight. Dawei have published papers and served as reviewers in top NLP and Data Mining venues including ACL, EMNLP, NAACL, TKDD, PAKDD and SIGKDD Exploration.

Yue Huang

Ph.D. Student – University of Notre Dame

Yue Huang is a Ph.D. student in Computer Science and Engineering at the University of Notre Dame. He earned his B.S. in Computer Science from Sichuan University. His research investigates the trustworthiness and social responsibility of foundation models. Yue has published extensively at premier venues including NeurIPS, ICLR, ICML, ACL, EMNLP, NAACL, CVPR, and IJCAI. His work has been highlighted by the U.S. Department of Homeland Security and recognized with the Microsoft Accelerating Foundation Models Research Award and the KAUST AI Rising Star Award (2025).

Ming Li

Ph.D. Student – University of Maryland

Ming Li is a Ph.D. student in Computer Science at the University of Maryland. Previously, He obtained his bachelor’s degree in Computer Science from Xi'an Jiaotong University and his master’s degree in Computer Science from Texas A\&M University. His research focuses on post-training for foundation models and responsible and self-evolving AI. Ming has published papers and served as a reviewer in top NLP and Machine Learning venues, including ACL, EMNLP, ICLR, NAACL, and etc.

Tianyi Zhou

Assistant Professor – University of Maryland

Tianyi Zhou is a tenure-track assistant professor of Computer Science at the University of Maryland, College Park (UMD). He received his Ph.D. from the University of Washington and worked as a research scientist at Google before joining UMD. His research interests are machine learning, natural language processing, and multi-modal generative AI. His team has published >130 papers in ML (NeurIPS, ICML, ICLR), NLP (ACL, EMNLP, NAACL), CV (CVPR, ICCV, ECCV), and journals such as IEEE TPAMI/TIP/TNNLS/TKDE, with >10000 citations. He is the recipient of the best student paper of ICDM 2013. He has been serving as an area chair of ICLR, NeurIPS, ACL, EMNLP, SIGKDD, AAAI, IJCAI, WACV, etc.

Xiangliang Zhang

Leonard C. Bettex Collegiate Professor – University of Notre Dame

Xiangliang Zhang is a Leonard C. Bettex Collegiate Professor in the Department of Computer Science and Engineering, University of Notre Dame. She was an Associate Professor in Computer Science at the King Abdullah University of Science and Technology (KAUST), Saudi Arabia. She received her Ph.D. degree in computer science from INRIA-Universite Paris Sud, France, in 2010. Her main research interests and experiences are in machine learning and data mining. She has published more than 270 refereed papers in leading international conferences and journals. She serves as associate editor of IEEE Transactions on Dependable and Secure Computing, Information Sciences, and International Journal of Intelligent Systems, and regularly serves as area chair or on the (senior) program committee of IJCAI, SIGKDD, NeurIPS, AAAI, ICML, and WSDM.

Huan Liu

Regents Professor – Arizona State University

Huan Liu is a Regent Professor in the School of Computing, and Augmented Intelligence, Arizona State University. He received his Ph.D. degree in Computer Science from the University of Southern California, in 1989. His research focuses on developing computational methods for data mining, machine learning, and social computing. Dr. Liu has been honored with numerous prestigious awards: ACM SIGKDD Innovation Award (2022) for his pioneering work in feature selection and social media mining, Fellow of ACM (2018), AAAI (2019), AAAS (2018), and IEEE (2012). He is Chief Editor of ACM TIST, Frontiers in Big Data and DMM, and has been actively involved on editorial boards and program committees for major conferences such as KDD, ICML, NeurIPS, AAAI, and IJCAI.

Generative Models for Synthetic Data

Overview

Tutorial Outline

Introduction & Background

Core Generative Models

Synthetic Data in Practice

Evaluation & Benchmarking

Applications in Data Mining

Real-world Scenarios

Hands-on Practice

Challenges & Future Directions

Presenters

Dawei Li

Yue Huang

Ming Li

Tianyi Zhou

Xiangliang Zhang

Huan Liu