At Mobile World Congress (MWC) 2025, we reached an exciting milestone—real-time AI-generated audio on smartphones. This breakthrough was made possible thanks to a fruitful collaboration between Stability.AI and Arm and was powered by leveraging KleidiAI technologies. Together, we showcased a live demo that generated 10 seconds of audio in just ~7 seconds using the Stable Audio Open Small.
Whether you are a content creator, musician, DJ, sound designer, or audio enthusiast, This enables users to generate sound effects or audio samples in seconds. And as for the quality? Listen for yourself —stereophonic, crisp, and with 44 KHz sample rate.
Seawaves: Play this audio clip Soulful Calm Hip Hop: Play this audio clip Warm Arpeggios on house beats: Play this audio clip Synthwave loop: Play this audio clip
This is not just a demo. Everything in the showcase is now available to the public.
With Arm CPUs in 99% of smartphones, we focused on making AI audio generation accessible, efficient, and performant on the most ubiquitous processor ever using the Arm KleidiAI library.
KleidiAI is already integrated into leading AI runtimes such as ExecuTorch and LiteRT with XNNPack, as well as MNN. For this learning path, we targeted LiteRT as the deployment ML runtime.
Deploying models on mobile devices requires optimization of both execution speed and memory footprint to ensure efficient performance, particularly within the constrained RAM available across a broad spectrum of devices.
To improve inference efficiency, we adopted a combination of dynamic quantization and reduced-precision techniques aacross the different submodules of the Stable Audio Open model. Specifically, we applied dynamic Int8 quantization to the DiT component and moved the autoencoder to fully FP16 precision. While we initially considered broader application of FP16 and Int8 across the pipeline, this specific configuration provided the best trade-off between performance gains and output quality, without the need for quantization-aware training (QAT).
Dynamic Int8 quantization applies static quantization to weights and dynamic, runtime-based quantization to FP32 activations based on their statistical values distribution. This method is particularly effective for the DiT linear layers, as it reduces memory and computation overhead, while maintaining quality. Moreover, converting the autoencoder to FP16 further reduces memory usage and speeds inference, with minimal audio quality loss.
This is the configuration used in the learning path, as it offers a solid balance between efficiency and perceptual audio quality. With it, you can expect to generate approximately 10 seconds of audio in around 8 seconds on premium phones.
However, as discussed in our paper, more aggressive quantization strategies can yield further performance improvements at the cost of output audio degradation—particularly for audio samples with rich high-frequency content, which are more sensitive to reduced bitwidth. One such example, detailed in the paper but not implemented in the learning path, involves applying dynamic quantization from FP16 to Int8 across both the DiT and the autoencoder selectively.
Using this more aggressive configuration, we achieved a reduction in inference time from 15.3s (original FP32 baseline) to 6.6s, along with a decrease in model size from 5.2GB to 2.9GB and a drop in peak runtime RAM usage from 6.5GB to 3.6GB.
The MWC 2025 demo and the learning path were made possible thanks to ExecuTorch, LiteRT, KleidiAI, and XNNPack, which are blazing the trail for accessible, efficient AI on mobile. You can build a mobile music generator, sound effects app, or DJ console on an Arm-powered laptop using our Learning Path.
In our Learning Path, you will learn everything you need to convert the models to LiteRT-compatible formats and run these models on Arm CPUs with the LiteRT runtime through XNNPack and KleidiAI.
Specifically, the audiogen application consists of a single file (audiogen.cpp) and includes everything required to run the Stable Audio Open Small pipeline. Specifically, it provides utility functions for the following:
This is just the beginning. Explore the learning path and start building your own mobile audio applications. We look forward to seeing what you create.
This project was made possible through the collaboration of many brilliant minds at Stability.AI and Arm. In alphabetical order:From Arm:Adnan AlSinan, Anitha Raj, Aude Vuilliomenet, Evie Wright, Gian Marco Iodice, Michael Kozlov, Nina Drozd, Ronan Naughton, Tobias McBrideFrom Stability.AI:CJ Carr, Josiah Taylor, Julian Parker, Jordi Pons, Zach Evans, Zack Zukowski, Zachary NovackTo learn more about the collaboration between Stability.ai and Arm do not forget to read the Stability.ai blog post as well, available at the following link:
Learn more