pisco_log
banner

Audio Palette: A Diffusion Transformer with Multi-Signal Conditioning for Controllable Foley Synthesis

Junnuo Wang

Abstract


Recent advances in diffusion-based generative models have enabled high-quality text-to-audio synthesis, but fine-grained acoustic
control remains a significant challenge in open-source research. We present Audio Palette, a diffusion transformer (DiT) based model that extends the Stable Audio Open architecture to address this "control gap" in controllable audio generation. Unlike prior approaches that rely solely on semantic conditioning, Audio Palette introduces four time-varying control signalsloudness, pitch, spectral centroid, and timbrefor
precise and interpretable manipulation of acoustic features. The model is efficiently adapted for the nuanced domain of Foley synthesis using
Low-Rank Adaptation (LoRA) on a curated subset of AudioSet, requiring only 0.85% of the original parameters to be trained. Experiments
demonstrate that Audio Palette achieves fine-grained, interpretable control of sound attributes. Crucially, it accomplishes this novel controllability while maintaining high audio quality and strong semantic alignment to text prompts, with performance on standard metrics such as
Frhet Audio Distance (FAD) and LAION-CLAP scores remaining comparable to the original baseline model. We provide a scalable, modular pipeline for audio research, emphasizing sequence-based conditioning, memory efficiency, and a novel three-scale classifier-free guidance
mechanism for nuanced inference-time control. This work establishes a robust foundation for controllable sound design and performative audio synthesis in open-source settings, enabling a more artist-centric workflow.

Keywords


Sound generation; Diffusion model; Transfer learning; Language model; Controllable synthesis; Foley synthesis

Full Text:

PDF

Included Database


References


[1] W. Peebles and S. Xie, "Scalable diffusion models with transformers, " arXiv preprint arXiv:2212.09748, 2022.

[2] Stability AI, "Stable Audio Open, " 2024. [Online]. Available: https://stability.ai/news/stable-audio-open-research-paper

[3] V. Ament, The Foley Grail: The Art of Performing Sound for Film, Games, and Animation, 3rd ed. Routledge, 2021. doi:

10.4324/9781003008439

[4] N. Flores Garcia and N. J. Bryan, "Sketch2Sound: Controllable audio generation via time-varying signals and sonic imitations, " in Proc.

IEEE ICASSP, 2025.

[5] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, "CREPE: A convolutional representation for pitch estimation, " in Proc. IEEE ICASSP, pp.

161165, 2018.

[6] J. F. Gemmeke, D. P. Ellis, D. Freedman, M. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, "Audio Set: An ontology and

human-labeled dataset for audio events, " in Proc. IEEE ICASSP, pp. 776780, 2017.

[7] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, "LoRA: Low-rank adaptation of large language models, " arXiv preprint arXiv:2106.09685, 2021.

[8] K. Kilgour, A. D'Gama, M. Sanchez, and B. Styles, "Frchet audio distance: A metric for evaluating music enhancement algorithms, "

arXiv preprint arXiv:1812.08466, 2018.

[9] Y. Wu, Z. Chen, D. Liu, G. Liu, A. Pasa, W. Yang, and Y. Wu, "Large-scale contrastive language-audio pretraining with feature fusion

and keyword-to-caption augmentation, " in Proc. IEEE ICASSP, pp. 15, 2023.




DOI: http://dx.doi.org/10.70711/aitr.v3i2.7860

Refbacks

  • There are currently no refbacks.