CIKM 2025 Tutorial – Transforming Data Mining in the GenAI Era
Generative models—Large Language Models (LLMs), Diffusion Models, and Generative Adversarial Networks (GANs)—have revolutionized synthetic-data creation, offering scalable solutions to data scarcity, privacy, and annotation challenges in data mining.
This tutorial surveys foundations and state-of-the-art advances in synthetic-data generation, details practical frameworks, and reviews evaluation strategies and real-world applications. Attendees will gain actionable insights into leveraging synthetic data for research and industrial pipelines.
Targeted at researchers and practitioners in data mining, machine learning, NLP, vision, and multimodal analytics, the session equips participants to harness synthetic data effectively while understanding its opportunities and challenges.
Eight modules spanning foundations, practice, and future directions
Defining synthetic data and motivating factors: data scarcity, privacy, annotation cost, low-resource settings.
Comparative deep-dive into GANs, Diffusion Models, and LLMs: architectures, strengths, and limitations.
Hands-on tour of frameworks such as MagPie, DataGen, DyVal (text) and TaskMeAnything, AutoBench-V (multimodal).
Fidelity, diversity, controllability, and downstream-utility metrics; open challenges in bias and generalization.
Use cases across text, tabular, graph, sequential, and multimodal data; boosting model robustness and privacy.
Case studies in healthcare, finance, and education demonstrating privacy-preserving synthetic data pipelines.
Live demo notebook—generating synthetic text, tabular, graph, sequential, and visual data.
Model-collapse risks, integrating traditional augmentation with GenAI, and open research questions (Q&A).
Ph.D. Student – Arizona State University
Dawei Li is a Ph.D. student in Computer Science at Arizona State University. Previously, He obtained his bachelor’s degree in Computer Science from Beijing Language and Cultural University and master’s degree in Data Science from the University of California, San Diego. His research focuses on techniques and risks from AI oversight. Dawei have published papers and served as reviewers in top NLP and Data Mining venues including ACL, EMNLP, NAACL, TKDD, PAKDD and SIGKDD Exploration.
Ph.D. Student – University of Notre Dame
Yue Huang is a Ph.D. student in Computer Science and Engineering at the University of Notre Dame. He earned his B.S. in Computer Science from Sichuan University. His research investigates the trustworthiness and social responsibility of foundation models. Yue has published extensively at premier venues including NeurIPS, ICLR, ICML, ACL, EMNLP, NAACL, CVPR, and IJCAI. His work has been highlighted by the U.S. Department of Homeland Security and recognized with the Microsoft Accelerating Foundation Models Research Award and the KAUST AI Rising Star Award (2025).
Ph.D. Student – University of Maryland
Ming Li is a Ph.D. student in Computer Science at the University of Maryland. Previously, He obtained his bachelor’s degree in Computer Science from Xi'an Jiaotong University and his master’s degree in Computer Science from Texas A\&M University. His research focuses on post-training for foundation models and responsible and self-evolving AI. Ming has published papers and served as a reviewer in top NLP and Machine Learning venues, including ACL, EMNLP, ICLR, NAACL, and etc.
Assistant Professor – University of Maryland
Tianyi Zhou is a tenure-track assistant professor of Computer Science at the University of Maryland, College Park (UMD). He received his Ph.D. from the University of Washington and worked as a research scientist at Google before joining UMD. His research interests are machine learning, natural language processing, and multi-modal generative AI. His team has published >130 papers in ML (NeurIPS, ICML, ICLR), NLP (ACL, EMNLP, NAACL), CV (CVPR, ICCV, ECCV), and journals such as IEEE TPAMI/TIP/TNNLS/TKDE, with >10000 citations. He is the recipient of the best student paper of ICDM 2013. He has been serving as an area chair of ICLR, NeurIPS, ACL, EMNLP, SIGKDD, AAAI, IJCAI, WACV, etc.
Leonard C. Bettex Collegiate Professor – University of Notre Dame
Xiangliang Zhang is a Leonard C. Bettex Collegiate Professor in the Department of Computer Science and Engineering, University of Notre Dame. She was an Associate Professor in Computer Science at the King Abdullah University of Science and Technology (KAUST), Saudi Arabia. She received her Ph.D. degree in computer science from INRIA-Universite Paris Sud, France, in 2010. Her main research interests and experiences are in machine learning and data mining. She has published more than 270 refereed papers in leading international conferences and journals. She serves as associate editor of IEEE Transactions on Dependable and Secure Computing, Information Sciences, and International Journal of Intelligent Systems, and regularly serves as area chair or on the (senior) program committee of IJCAI, SIGKDD, NeurIPS, AAAI, ICML, and WSDM.
Regents Professor – Arizona State University
Huan Liu is a Regent Professor in the School of Computing, and Augmented Intelligence, Arizona State University. He received his Ph.D. degree in Computer Science from the University of Southern California, in 1989. His research focuses on developing computational methods for data mining, machine learning, and social computing. Dr. Liu has been honored with numerous prestigious awards: ACM SIGKDD Innovation Award (2022) for his pioneering work in feature selection and social media mining, Fellow of ACM (2018), AAAI (2019), AAAS (2018), and IEEE (2012). He is Chief Editor of ACM TIST, Frontiers in Big Data and DMM, and has been actively involved on editorial boards and program committees for major conferences such as KDD, ICML, NeurIPS, AAAI, and IJCAI.