Legal Risks and Regulatory Frameworks for Synthetic 
Data in AI Large-Model Training

Siqi Yuan

doi:10.70711/aitr.v3i3.8037

Legal Risks and Regulatory Frameworks for Synthetic Data in AI Large-Model Training

Siqi Yuan

Abstract

Synthetic data represents a critical solution for reconciling the need for personal information protection with data utilization in
artificial intelligence (AI) large-model training. However, its generation and application involve complex legal risks, such as systemic risks
arising from quality defects, limitations in privacy protection, the reinforcement and amplification of biases, and the potential for misuse.
To address these risks, a multidimensional regulatory framework is necessary, encompassing quality standards, algorithmic transparency,
traceability mechanisms, proactive safety protection, and accountability, aiming to strike a dynamic balance between technological innovation and risk mitigation.

Keywords

Synthetic Data; Legal Regulation; Data Privacy

Full Text:

PDF

Included Database

References

[1] Emma Keen, Gartner Identifies Top Trends Shaping the Future of Data Science and Machine Learning, GARTNER, Aug. 1, 2023.

[2] Gal, Michal S. & Lynskey, Orla, Synthetic Data: Legal Implications of the Data-Generation Revolution, Iowa Law Review,

Vol.109:1087, p.1094-1095(2024).

[3] Kurapati, S., & Gilli, L., Synthetic Data: Convergence between Innovation and GDPR, Journal of Open Access to Law, Vol.11:1, p.1-

12(2023).

[4] Bellovin Steven M., Dutta, Preetam K., Reitinger N., Privacy and Synthetic Datasets, Stanford Technology Law Review, Vol.22:1, p.21-

41(2018).

[5] Fernando Lucini, The Real Deal About Synthetic Data, at https://sloanreview.mit.edu/article/the-real-deal-about-synthetic-data (Last visited on April 17, 2025).

[6] Ilia Shumailov, et al., AI Models Collapse When Trained on Recursively Generated Data, Nature, Vol.631:8022, p.755-759(2024).

[7] Rohan Taori & Tatsunori B. Hashimoto, Data Feedback Loops: Model-driven Amplification of Dataset Biases, in Proceeding of the 40th

International Conference on Machine Learning, New York: JMLR. Org, 2023.

[8] Ilkhan Ozsevim, Research Finds ChatGPT & Bard Headed for 'Model Collapse', at https://aimagazine.com/articles/research-finds-chatgpt-headed-for-model-collapse (Last visited on May 8, 2025).

[9] Ebers M., Standardizing AI: The Case of the European Commission's Proposal for an 'Artificial Intelligence Act', Cambridge University

Press, 2022, p.331.

[10] Emiliano De Cristofaro, Synthetic Data: Methods, Use Cases, and Risks, Security & Privacy, Vol.22:3, p.62-67(2024).

[11] Haonan Zhong et al., Copyright Protection and Accountability of Generative AI: Attack, Watermarking and Attribution, at https://doi.

org/10.48550/arXiv.2303.09272(Last visited on May 10, 2025).

[12] Peter Lee, Synthetic Data and the Future of AI, Cornell Law Review, Vol.110:1, p.40-42(2025).

[13] Boudewijn, Alexander & Ferraris, Andrea F., Legal and Regulatory Perspectives on Synthetic Data as an Anonymiz.

DOI: http://dx.doi.org/10.70711/aitr.v3i3.8037

Refbacks

There are currently no refbacks.