AudioGPT: AI System Connecting ChatGPT With Audio Foundation Models

AudioGPT: AI System Connecting ChatGPT With Audio Foundation Models

The AI community is currently experiencing a profound impact from the emergence of large language models (LLMs). Introduction of advanced models like ChatGPT and GPT-4 has significantly propelled the progress of natural language processing. Thanks to the immense availability of web-text data and robust architectural frameworks, LLMs now possess the ability to read, write, and engage in conversations just like humans.

However, despite the remarkable achievements in text processing and generation, the successful incorporation of audio modalities such as voice, music, sound, and talking heads remains limited. This is despite the fact that audio modalities offer significant advantages in real-world scenarios. In daily conversations, humans primarily rely on spoken language to communicate, and spoken assistants have become key tools in enhancing convenience in our lives. Therefore, it is crucial to enable LLMs to understand and produce voice, music, sound, and talking heads in order to advance towards more sophisticated AI systems.

Nevertheless, there are several challenges hindering the training of LLMs in audio processing. Firstly, the availability of authentic spoken conversations from real-world sources is extremely limited. Acquiring human-labeled speech data is a costly and time-consuming endeavor. Moreover, compared to the vast corpora of web-text data, there is a shortage of multilingual conversational speech data. Secondly, training multi-modal LLMs from scratch requires significant computational resources and consumes substantial time.

The collaborative efforts of researchers from Zhejiang University, Peking University, Carnegie Mellon University, and the Remin University of China have birthed a remarkable system known as “AudioGPT”. This groundbreaking creation is designed to excel in comprehending and generating audio modality within spoken dialogues. Rather than starting from scratch, the researchers leverage a range of audio foundation models to effectively process intricate audio information. By integrating these models with input/output interfaces specifically designed for speech conversations, they have achieved remarkable results.

The ingenious use of Language and Learning Models (LLMs) as a general-purpose interface empowers AudioGPT to tackle an extensive array of audio understanding and generation tasks. Instead of wasting valuable resources on training LLMs solely for spoken language, the researchers find it more efficient to connect them with existing audio foundation models capable of understanding and producing speech, music, sound, and even talking heads.

To enhance effective communication, LLMs utilize input/output interfaces and ChatGPT, which convert speech into text. ChatGPT, equipped with a conversation engine and prompt manager, accurately interprets a user’s intention when processing audio data.

The process of AudioGPT can be broken down into four distinct parts, beautifully illustrated in Figure 1. First, there is the transformation of modality, where ChatGPT and LLMs seamlessly communicate through speech-to-text conversion. Then, the analysis of tasks occurs, with ChatGPT intelligently determining a user’s intent using its conversation engine and prompt manager. Following this, a suitable model is assigned by ChatGPT, carefully selecting the most appropriate audio foundation model based on structured arguments for elements such as prosody, timbre, and language control. Lastly, the system designs a response, generating and delivering a refined and conclusive answer to the user, thanks to the execution of the audio foundation model.

Overall, the collaborative research efforts have produced a state-of-the-art system that revolutionizes audio comprehension and generation in spoken dialogues. The integration of advanced models and intelligent processes showcases the ingenuity and expertise of the researchers involved.

The assessment of multi-modal LLMs’ efficacy in understanding human intention and facilitating the collaboration of diverse foundation models is emerging as a highly popular research topic. Experimental findings indicate that AudioGPT showcases remarkable capabilities in handling intricate audio data within multi-turn dialogues, catering to various AI applications such as speech synthesis, music generation, sound interpretation, and talking head animation. This study outlines the design principles and evaluation framework employed to scrutinize the consistency, capacity, and robustness of AudioGPT.

Furthermore, the paper highlights AudioGPT’s crucial contribution of equipping ChatGPT with audio-focused foundation models, thereby enabling sophisticated audio tasks. To facilitate spoken communication, a modality transformation interface is integrated into ChatGPT as a versatile tool. The authors delve into the design principles and evaluation methodology for multi-modal LLMs, thoroughly examining the consistency, capacity, and robustness of AudioGPT.

Remarkably, AudioGPT surpasses expectations by effectively comprehending and generating audio through multiple rounds of conversation, empowering users to effortlessly produce diverse and high-quality audio content. For the convenience of the community, the code for AudioGPT has been made publicly available on GitHub.

Leave a Reply