Scale-up unlearnable examples learning with high-performance computing
Author(s)
Yanfan Zhu
Issac Lyngaas
Murali Meena
Mary Ellen Koran
Bradley Malin
Daniel Moyer
Shunxing Bao
Anuj Kapadia
Bennett Landman
Xiao Wang
Yuankai Huo
Abstract
Recent advancements in AI models, like ChatGPT, are structured to retain user interactions, which could inadvertently include sensitive healthcare data. In the healthcare field, particularly when radiologists use AI-driven diagnostic tools hosted on online platforms, there is a risk that medical imaging data may be repurposed for future AI training without explicit consent, spotlighting critical privacy and intellectual property concerns around healthcare data usage. Addressing these privacy challenges, a novel approach known as Unlearnable Examples (UEs) has been introduced, aiming to make data unlearnable to deep learning models. A prominent method within this area, called Unlearnable Clusters (UCs), has shown improved UE performance but was previously limited by computational resources (e.g., a single workstation) which restricted the exploration of larger-scale experiments.
To push the boundaries of UE performance with theoretically vast resources, we scaled up UCs learning across various datasets using Distributed Data Parallel (DDP) training on the Summit supercomputer. Our goal was to examine UE efficacy at high-performance computing (HPC) levels to prevent unauthorized learning and enhance data security, particularly focusing on the influence of smaller batch sizes in increasing unlearnability, reflected by reduced model accuracy.
Through the computational power of Summit, extensive testing on diverse datasets—such as Pets, MedMNist, Flowers, and Flowers102—was conducted. Results showed that excessively large or small batch sizes led to unstable performance and influenced accuracy. Smaller batch sizes on specific datasets such as pathMNist, BloodMNist and Flowers102 generally correlated with lower accuracy, a desired outcome for unlearnable data, shielding it from inference attacks. However, the optimal batch size for unlearnability varied across datasets, underscoring the need for dataset-specific computational scaling for effective data protection. Summit’s high-performance GPUs, paired with DDP efficiency, enabled rapid parameter updates and consistent training across nodes, essential for determining batch sizes that maximize unlearnability without sacrificing computational efficiency. These findings highlight the importance of selecting suitable batch sizes tailored to dataset characteristics to prevent unauthorized model learning and ensure data security in deep learning applications. The source code for this study is publicly accessible at https://github.com/hrlblab/UE_HPC.
Scale-up unlearnable examples learning with high-performance computing
Description
Date and Location: 2/4/2025 | 11:40 AM - 12:00 PM | Regency APrimary Session Chair:
Yuankai Huo | Vanderbilt University
Session Co-Chair:
Paper Number: HPCI-184
Back to Session Gallery