Long-Form Text-to-Music Generation with Adaptive Prompts:
A Case of Study in Tabletop Role-Playing Games Soundtracks

Felipe Marra Lucas N. Ferreira

LAMIR 2024


Paper: https://arxiv.org/html/2411.03948v1

Code: github.com/FelipeMarra/babel-bardo

Abstract

This paper investigates the capabilities of text-to-audio music generation models in producing long-form music with prompts that change over time, focusing on soundtrack generation for Tabletop Role-Playing Games (TRPGs). We introduce Babel Bardo, a system that uses Large Language Models to transform speech transcriptions into music descriptions for controlling a text-to-music model. Four versions of Babel Bardo were compared in two TRPG campaigns: a baseline using direct speech transcriptions, and three LLM-based versions with varying approaches to music description generation. Evaluations considered audio quality, story alignment, and transition smoothness. Results indicate that detailed music descriptions improve audio quality, while maintaining consistency across consecutive descriptions enhances transition smoothness. Emotion-based descriptions proved effective for aligning generated music with TRPG narratives.


System Overview

At every 30 seconds of gameplay, Babel Bardo transcribes the players’ speeches into a text si using a Speech Recognition system and uses a Large Langue Model (LLM) to map si into a music description di that matches the scene described by the players. This music description is given to a Text-to-Music system that generates a 30-second piece ai directly in the audio domain.




Demo

The following presents examples containing a video link and four, three minutes-long, embedded videos. The link points to the original gameplay video, starting at the same time as the generated examples. The first of the embedded videos will also correspond to the original gameplay, but in the RPG “O Segredo na Ilha” it will be with the original soundtrack volume augmented. The other three embedded videos will be the soundtrack generated by three versions of Babel Bardo, namely: Emotion, Description and Description Continuation. The name of the Babel Bardo version will be a link to the system “logs”. The logs contain the LLM prompt, the transcription generated by the SR system, and the LLM generated music description.

Example 1

Video Link: Call of The Wild, Episode 2

About: In this example, we start at 1 min in the episode. Although the generation started at 0 min, we decided to center the example around a point where the systems might have identified an emotion transition. The video starts with cuts of episode 1, remembering the tragedies faced by the players. When the adventure of episode 2 begins, the players are talking about invading a farmer village. The possible transition point happens at minute 1:27, when the master narrates a calm scene about the farmers, where a father and his son are arriving at the village. Although it looks like a calm scene, in this context the players will actually destroy the village, since the villagers kidnapped members of the player’s tribe.

Original

Emotion

Babel Bardo Emotion receives the emotion "Calm" from the LLM in this excerpt, but it maintains its consistency in relation to the previously generated music.

Description

The Description version is consistent in generating a suspelceful piece.

Description Continuation

We can observe that Babel Bardo Description Continuation (BBDC) reacted to the calm scene at 1:50, generating a calm guitar. BBDC goes back and forth into calm guitar and suspense music, then settles up to calm guitar.

Example 2

Video Link: O Segredo na Ilha, Episode 1

About: In this example, we start at 3:23:09 hours in the episode. The players are discussing about letters they found in a aboundoned mansion. They are trying to discover who the authors of the letters were, they might have been employees of the mantion or resident of the island were the mantion is situated. They are also trying to discover the meaning of some documents they found. The overall the atmosphere is tense and mysterious.

Original w/ Augmented Volume

Emotion

Babel Bardo Emotion prompts go from "Suspenseful", to "Agitated", to "Calm". The music sounds a litter off from the start, with a theme that sounds somewhat oriental. It creates and maintains a theme.

Description

Babel Bardo Description starts with a great suspenseful song, but then completely looses it at 1:53.

Description Continuation

Babel Bardo Description Continuation receives a prompts to just continue the previus generation w/o a new description 4 times in a row. It generates an interesting mysterious piece, but the transitions and generation restarts very noticeble. Notice the transition around the 1 min mark.

Example 3

Video Link: Call Of The Wild, Episode 6

About: In this example, we start at 16:21 minutes in the episode and 15:00 minutes in the generation, to verify how the system is performing after several minutes of generation. The original piece starts suspenseful, since the players were in a tense moment trying to atract the enemy into a battle. When the battle begins, around 1:23, the original music changes into a more agited one.

Original w/ Augmented Volume

Emotion

Babel Bardo Emotion starts with a suspenseful piece, matching the overall vibe of the original. Although it received various "Agitated" prompts, it remains consistent and the music is not adapted to a battle theme.

Description

After the first 30s of generatino Babel Bardo Description starts creating inumerous silences throughout the generated piece. It keeps the suspenseful vibe, also not adapting to a more battle-like theme. At 2:22 the drums goes crazy.

Description Continuation

Babel Bardo Description Continuation starts with an interesting tense theme. At 51s it gets more agitated, then it restarts, and keeps a somewhat consistent generation with some pretty interesting horns in the end.