May 21, 2025

CMU's "Tartan Federer" Team Sweeps All Four Tracks at International AI Privacy Challenge

By Josh Quicksall jquicksa(through)andrew.cmu.edu

Media Inquiries

Aaron Aupperlee

Senior Director of Media Relations, SCS
aaupperlee(through)cmu.edu

Carnegie Mellon University's "Tartan Federer" team, led by S3D Assistant Professor Zhiwei Steven Wu, achieved a remarkable feat by winning all four tracks of the Vector Institute MIDST Challenge at SaTML 2025. The team's innovative approach to membership inference attacks revealed significant privacy vulnerabilities in diffusion models used for generating synthetic tabular data, with implications for industries relying on AI-generated information while protecting sensitive data.

Clean Sweep at International Competition

The Tartan Federer team, comprised of members from Carnegie Mellon University's Software and Societal Systems Department (S3D), dominated the MIDST Challenge (Membership Inference over Diffusion-models-based Synthetic Tabular data) at the 3rd IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). The competition took place over two months prior to the SaTML conference, held April 9–11, 2025, at the University of Copenhagen in Denmark, and featured 71 competing teams from around the world.

Led by S3D faculty member Zhiwei Steven Wu, the team included Xiaoyu (Nicholas) Wu, a visiting master's student from Shanghai Jiao Tong University; Yifei Pang, a research intern at CMU pursuing an MS in Information Security; and Terrance Liu, a PhD student in the Machine Learning Department at CMU.

The team secured first place across all four competition tracks. This complete sweep demonstrates the versatility and effectiveness of their approach across different scenarios and constraints.

The MIDST Challenge specifically focused on testing the privacy vulnerabilities of diffusion models – a class of AI systems that have recently gained popularity for generating synthetic data. Diffusion models work by gradually adding noise to data and then learning to reverse this process, allowing them to create realistic-looking artificial data points.

The Vector Institute, a leading AI research organization, designed the challenge to evaluate how vulnerable these systems are to membership inference attacks when used to generate synthetic tabular data. Tabular data – information organized in rows and columns like spreadsheets or databases – is particularly important in fields like healthcare and finance, where privacy concerns are paramount.

The competition was structured around four distinct tracks, testing different aspects of privacy vulnerabilities:

White-box tracks gave competitors full access to model parameters and architecture
Black-box tracks limited access to only model outputs
Single-table tracks focused on standard tabular data (like a single database table)
Multi-table tracks addressed more complex relational data across multiple tables

Participants were evaluated based on their true positive rate at a 10% false positive rate, measuring how effectively they could identify training data while maintaining a low false positive rate.

What is Membership Inference?

Membership inference attacks attempt to determine whether specific data records were used to train an AI model. Think of it like discovering if someone's medical record was included in the dataset used to develop a healthcare AI system – a significant privacy concern.

When successful, these attacks can reveal sensitive information about individuals whose data was used during AI development. This matters because organizations often assume that once data is transformed into a model, the original information cannot be recovered.

For example, a membership inference attack might reveal that a particular person's financial transactions were used to train a credit scoring algorithm, potentially exposing private financial behaviors or circumstances that should remain confidential.

Technical Innovation

The Tartan Federer team's success stemmed from their novel approach to membership inference attacks tailored specifically for diffusion models applied to tabular data. Their research revealed that existing methods designed for image-based diffusion models performed poorly in the tabular domain, necessitating a fresh approach.

“Membership inference turned out to be much trickier on tabular diffusion models than on vision models,” said Wu. “But the students on the Tartan Federer team came up with a clever approach—they used loss features across different noise levels and time steps, ran them through a lightweight neural net, and were able to tease out strong membership signals without a lot of manual tuning.”

This innovative approach proved effective across both black-box and white-box settings, as well as for both single-table and multi-table scenarios. The versatility of their method allowed them to outperform specialized approaches in each track, including runners-up like Yan Pang (White-box Single Table), CITADEL & UQAM (Black-box Single Table), and Cyber@BGU (Black-box Multi-table).

Particularly notable was their performance in the White-box Multi-table track, where they were the only team to exceed random guessing with a 35% true positive rate at a 10% false positive rate.

S3D Connection and Broader Impact

This achievement exemplifies S3D's mission of understanding and improving how computational technologies can better serve societies and communities. The department's interdisciplinary approach, combining rigorous computer science with insights from social sciences and policy studies, provided the ideal environment for this research at the intersection of privacy, security, and responsible AI development.

Wu's leadership in this area aligns perfectly with S3D's research focus on societal computing, particularly privacy and security concerns in modern AI systems. The team's work demonstrates how technical innovation can address real-world challenges in responsible technology development.

The findings have significant implications for organizations working with sensitive data. As diffusion models become increasingly popular for generating synthetic data, understanding their privacy vulnerabilities becomes crucial. The team's work provides valuable insights for developing more robust privacy-preserving techniques in synthetic data generation.

"As organizations increasingly turn to AI-generated data to protect privacy, we need to ensure these approaches actually deliver the privacy benefits they promise."

Publication and Future Directions

The team published their findings in a paper titled "Winning the MIDST Challenge: New Membership Inference Attacks on Diffusion Models for Tabular Data Synthesis" (arXiv: 2503.12008) on March 15, 2025, which was accepted at the Theory and Practice of Differential Privacy 2025 (TPDP 2025). They also made their code publicly available on GitHub. This commitment to open research exemplifies S3D's approach to advancing knowledge in responsible computing.

Looking ahead, this work opens several promising research directions, including developing more robust defenses against membership inference attacks and creating diffusion models with improved privacy guarantees. The team's insights will likely influence how privacy-preserving synthetic data is generated and evaluated across industries handling sensitive information.

For S3D, this achievement further establishes the department's leadership in addressing critical challenges at the intersection of technology and society. By developing methods to identify and mitigate privacy risks in advanced AI systems, the department continues its mission of creating computational technologies that responsibly serve communities and society at large.

For more information about the MIDST Challenge, visit the Vector Institute's website at vectorinstitute.github.io/MIDST. To learn more about Wu's research, visit zstevenwu.com.

Why Tabular Data Matters

Tabular data – information organized in rows and columns – is the backbone of many critical systems that handle sensitive information. Unlike images or text, tabular data often contains highly structured, personal information:

Healthcare records: Patient histories, treatment outcomes, and diagnostic information
Financial transactions: Credit card purchases, loan applications, and banking activities
Government records: Tax information, benefit claims, and demographic data

Organizations increasingly want to create synthetic versions of this data for testing new systems, training algorithms, or sharing with partners without exposing real individuals' information.

When privacy vulnerabilities exist in synthetic tabular data, the consequences can be more directly harmful than in other domains, potentially exposing medical conditions, financial status, or other sensitive personal details.

Why “Tartan Federer”?

The name is a playful mashup of CMU pride and tennis greatness. “Tartan” nods to CMU’s Scottish roots, honoring the founder, Andrew Carnegie. “Federer” entered the picture when Professor Wu’s group was naming their GPU server—a student quipped, “We need a server who can really serve,” and who better than Roger Federer? The name stuck, and thus, Tartan Federer was born.