DEV Community

David Mezzetti for NeuML

Posted on • Edited on • Originally published at neuml.hashnode.dev

Generative Audio

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

txtai works with much more than just text! It has rich multimedia and multimodal capabilities.

This article will demonstrate how to build generative audio workflows. These workflows will generate a combined audio stream with text and relevant audio for a series of poems.

Install dependencies

Install txtai and all dependencies.

pip install txtai[pipeline-audio] autoawq
Enter fullscreen mode Exit fullscreen mode

Define a Generative Audio workflow

The next section defines a generative audio workflow. This workflow consists of a set of pipelines as follows:

  • LLM
    • Llama 3 model used to describe the emotions of a given story or poem
  • Text To Audio
    • Builds audio given a text prompt
  • Text To Speech
    • Converts text to speech
  • Audio Mixer
    • Joins multiple audio streams together into a single stream
import logging

import soundfile as sf

from IPython.display import Audio, display

from txtai import LLM
from txtai.pipeline import AudioMixer, TextToAudio, TextToSpeech
from txtai.workflow import Workflow, Task, TemplateTask

# Enable DEBUG logging
logging.basicConfig()
logging.getLogger("txtai.workflow.base").setLevel(logging.DEBUG)
logging.getLogger("txtai.workflow.task.base").setLevel(logging.DEBUG)

def play(audio):
  # Convert to MP3 to save space
  sf.write("audio.wav", audio[0].T, audio[1])
  !ffmpeg -i audio.wav -y -b:a 64 audio.mp3 2> /dev/null

  # Play speech
  display(Audio(filename="audio.mp3"))
  return audio

# LLM
llm = LLM("hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4")

# Text to Audio
# Important: The code for musicgen is licensed as MIT but model weights are CC-BY-NC
tta = TextToAudio("facebook/musicgen-stereo-small")

# Audio mixer
mixer = AudioMixer()

# Define prompt template
template = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Write 3-5 emotions, keywords and holidays to describe the following text. ONLY answer with a comma separated list and no preceding statement.

{text}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

# Background music subworkflow
music = Workflow([
    TemplateTask(
        template=template,
        action=llm
    ),
    Task(action=tta),
])
Enter fullscreen mode Exit fullscreen mode

"The Raven" by Edgar Allan Poe

The first workflow will generate speech and corresponding background music for the first verse of "The Raven" by Edgar Allan Poe.

This poem is fitting given that Halloween was close at the time of publishing. 🎃👻🌕

# Text to speech
tts = TextToSpeech("neuml/vctk-vits-onnx", rate=32000)

# Define the workflow
workflow = Workflow(tasks=[
    Task(action=[lambda x: tts(x, speaker=3), music], merge="hstack", unpack=False),
    Task(action=lambda x: mixer(x, scale2=0.5), unpack=False),
    Task(action=lambda x: [play(y) for y in x], unpack=False)
])

list(workflow(["""
Once upon a midnight dreary, while I pondered, weak and weary,

Over many a quaint and curious volume of forgotten lore—

While I nodded, nearly napping, suddenly there came a tapping,

As of some one gently rapping, rapping at my chamber door.

’Tis some visitor, I muttered, "tapping at my chamber door— Only this and nothing more.”
"""]))
Enter fullscreen mode Exit fullscreen mode

Listen: https://www.youtube.com/watch?v=KHNAvoLUhHI

DEBUG:txtai.workflow.base:Running Task #0
DEBUG:txtai.workflow.task.base:Inputs: ['\nOnce upon a midnight dreary, while I pondered, weak and weary,\n\nOver many a quaint and curious volume of forgotten lore—\n\nWhile I nodded, nearly napping, suddenly there came a tapping,\n\nAs of some one gently rapping, rapping at my chamber door.\n\n’Tis some visitor, I muttered, "tapping at my chamber door— Only this and nothing more.”\n']
DEBUG:txtai.workflow.task.base:Outputs: [(array([0.00204471, 0.00245908, 0.00251085, ..., 0.00101355, 0.00124749,
       0.00157734], dtype=float32), 32000)]
DEBUG:txtai.workflow.task.base:Inputs: ['\nOnce upon a midnight dreary, while I pondered, weak and weary,\n\nOver many a quaint and curious volume of forgotten lore—\n\nWhile I nodded, nearly napping, suddenly there came a tapping,\n\nAs of some one gently rapping, rapping at my chamber door.\n\n’Tis some visitor, I muttered, "tapping at my chamber door— Only this and nothing more.”\n']
DEBUG:txtai.workflow.base:Running Task #0
DEBUG:txtai.workflow.task.base:Inputs: ['\n<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nWrite 3-5 emotions, keywords and holidays to describe the following text. ONLY answer with a comma separated list and no preceding statement.\n\n\nOnce upon a midnight dreary, while I pondered, weak and weary,\n\nOver many a quaint and curious volume of forgotten lore—\n\nWhile I nodded, nearly napping, suddenly there came a tapping,\n\nAs of some one gently rapping, rapping at my chamber door.\n\n’Tis some visitor, I muttered, "tapping at my chamber door— Only this and nothing more.”\n\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n']
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
DEBUG:txtai.workflow.task.base:Outputs: ['melancholy, mystery, curiosity, introspection, Halloween']
DEBUG:txtai.workflow.base:Running Task #1
DEBUG:txtai.workflow.task.base:Inputs: ['melancholy, mystery, curiosity, introspection, Halloween']
`torch.nn.functional.scaled_dot_product_attention` does not support having an empty attention mask. Falling back to the manual attention implementation. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.Note that this probably happens because `guidance_scale>1` or because you used `get_unconditional_inputs`. See https://github.com/huggingface/transformers/issues/31189 for more information.
DEBUG:txtai.workflow.task.base:Outputs: [(array([[-0.01685709, -0.0192524 , -0.01729976, ...,  0.02864039,
         0.02873872,  0.02577066],
       [-0.02714959, -0.0311739 , -0.02744334, ...,  0.2672284 ,
         0.266621  ,  0.26353633]], dtype=float32), 32000)]
DEBUG:txtai.workflow.task.base:Outputs: [(array([[-0.01685709, -0.0192524 , -0.01729976, ...,  0.02864039,
         0.02873872,  0.02577066],
       [-0.02714959, -0.0311739 , -0.02744334, ...,  0.2672284 ,
         0.266621  ,  0.26353633]], dtype=float32), 32000)]
DEBUG:txtai.workflow.base:Running Task #1
DEBUG:txtai.workflow.task.base:Inputs: [((array([0.00204471, 0.00245908, 0.00251085, ..., 0.00101355, 0.00124749,
       0.00157734], dtype=float32), 32000), (array([[-0.01685709, -0.0192524 , -0.01729976, ...,  0.02864039,
         0.02873872,  0.02577066],
       [-0.02714959, -0.0311739 , -0.02744334, ...,  0.2672284 ,
         0.266621  ,  0.26353633]], dtype=float32), 32000))]
DEBUG:txtai.workflow.task.base:Outputs: [(array([[-0.00638384, -0.00716712, -0.00613903, ..., -0.05081543,
        -0.05321765, -0.0549943 ],
       [-0.01153009, -0.01312787, -0.01121082, ..., -0.01355931,
        -0.02704029, -0.03997342]], dtype=float32), 32000)]
DEBUG:txtai.workflow.base:Running Task #2
DEBUG:txtai.workflow.task.base:Inputs: [(array([[-0.00638384, -0.00716712, -0.00613903, ..., -0.05081543,
        -0.05321765, -0.0549943 ],
       [-0.01153009, -0.01312787, -0.01121082, ..., -0.01355931,
        -0.02704029, -0.03997342]], dtype=float32), 32000)]
Enter fullscreen mode Exit fullscreen mode

This is quite amazing 🔥

From a single text verse, we not only generated speech, we also generated spooky background music to go along with it.

The LLM reads the text and writes a series of emotions, keywords and other descriptive words. That is then passed to a music generation model which creates the corresponding background music. Finally, an audio mixer pipeline joins the streams together and the audio is saved for playback.

This is the power⚡ of txtai workflows. Some may call it "agentic". Whatever we want to call it, it is able to combine multiple models small and large into a single execution flow.

"A Visit from St. Nicholas" by Clement Clarke Moore

Next, we'll create audio for the classic Christmas tale, also known as "The Night Before Christmas" 🎅🎄❄️

We'll use a different voice this time, mine! This is the default voice for txtai-speecht5-onnx.

tts = TextToSpeech("neuml/txtai-speecht5-onnx", rate=32000)

# Define the workflow
workflow = Workflow(tasks=[
    Task(action=[tts, music], merge="hstack", unpack=False),
    Task(action=lambda x: mixer(x, scale2=0.05), unpack=False),
    Task(action=lambda x: [play(y) for y in x], unpack=False)
])

list(workflow(["""
'Twas the night before Christmas, when all through the house, not a creature was stirring, not even a mouse.

The stockings were hung by the chimney with care, in hopes that Saint Nicholas soon would be there.

The children were nestled all snug in their beds, while visions of sugar plums danced in their heads.

And mamma in her kerchief, and I in my cap, had just settled our brains, for a long winter’s nap.

When out on the lawn there arose such a clatter, I sprang from my bed to see what was the matter.
"""]))
Enter fullscreen mode Exit fullscreen mode

Listen: https://www.youtube.com/watch?v=TnHkYMXyHhU

DEBUG:txtai.workflow.base:Running Task #0
DEBUG:txtai.workflow.task.base:Inputs: ["\n'Twas the night before Christmas, when all through the house, not a creature was stirring, not even a mouse.\n\nThe stockings were hung by the chimney with care, in hopes that Saint Nicholas soon would be there.\n\nThe children were nestled all snug in their beds, while visions of sugar plums danced in their heads.\n\nAnd mamma in her kerchief, and I in my cap, had just settled our brains, for a long winter’s nap.\n\nWhen out on the lawn there arose such a clatter, I sprang from my bed to see what was the matter.\n"]
DEBUG:txtai.workflow.task.base:Outputs: [(array([-3.9214676e-05,  1.7410064e-04,  1.9779154e-04, ...,
       -1.0386602e-03, -9.1643957e-04, -4.9463823e-04], dtype=float32), 32000)]
DEBUG:txtai.workflow.task.base:Inputs: ["\n'Twas the night before Christmas, when all through the house, not a creature was stirring, not even a mouse.\n\nThe stockings were hung by the chimney with care, in hopes that Saint Nicholas soon would be there.\n\nThe children were nestled all snug in their beds, while visions of sugar plums danced in their heads.\n\nAnd mamma in her kerchief, and I in my cap, had just settled our brains, for a long winter’s nap.\n\nWhen out on the lawn there arose such a clatter, I sprang from my bed to see what was the matter.\n"]
DEBUG:txtai.workflow.base:Running Task #0
DEBUG:txtai.workflow.task.base:Inputs: ["\n<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nWrite 3-5 emotions, keywords and holidays to describe the following text. ONLY answer with a comma separated list and no preceding statement.\n\n\n'Twas the night before Christmas, when all through the house, not a creature was stirring, not even a mouse.\n\nThe stockings were hung by the chimney with care, in hopes that Saint Nicholas soon would be there.\n\nThe children were nestled all snug in their beds, while visions of sugar plums danced in their heads.\n\nAnd mamma in her kerchief, and I in my cap, had just settled our brains, for a long winter’s nap.\n\nWhen out on the lawn there arose such a clatter, I sprang from my bed to see what was the matter.\n\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"]
DEBUG:txtai.workflow.task.base:Outputs: ['Peaceful, Hope, Joy, Christmas, Calm, Slumber, Wonder, Excitement']
DEBUG:txtai.workflow.base:Running Task #1
DEBUG:txtai.workflow.task.base:Inputs: ['Peaceful, Hope, Joy, Christmas, Calm, Slumber, Wonder, Excitement']
DEBUG:txtai.workflow.task.base:Outputs: [(array([[ 0.01238994,  0.00964349,  0.02490981, ..., -0.02256315,
        -0.02624696, -0.01479813],
       [-0.0111579 , -0.01307007,  0.00245246, ...,  0.02333916,
         0.01998244,  0.02509145]], dtype=float32), 32000)]
DEBUG:txtai.workflow.task.base:Outputs: [(array([[ 0.01238994,  0.00964349,  0.02490981, ..., -0.02256315,
        -0.02624696, -0.01479813],
       [-0.0111579 , -0.01307007,  0.00245246, ...,  0.02333916,
         0.01998244,  0.02509145]], dtype=float32), 32000)]
DEBUG:txtai.workflow.base:Running Task #1
DEBUG:txtai.workflow.task.base:Inputs: [((array([-3.9214676e-05,  1.7410064e-04,  1.9779154e-04, ...,
       -1.0386602e-03, -9.1643957e-04, -4.9463823e-04], dtype=float32), 32000), (array([[ 0.01238994,  0.00964349,  0.02490981, ..., -0.02256315,
        -0.02624696, -0.01479813],
       [-0.0111579 , -0.01307007,  0.00245246, ...,  0.02333916,
         0.01998244,  0.02509145]], dtype=float32), 32000))]
DEBUG:txtai.workflow.task.base:Outputs: [(array([[ 5.8028224e-04,  6.5627520e-04,  1.4432818e-03, ...,
         2.3652171e-04, -6.1154435e-04,  1.3553995e-03],
       [-5.9710984e-04, -4.7940284e-04,  3.2041437e-04, ...,
        -2.5115125e-03, -7.7958062e-04, -9.4321877e-05]], dtype=float32), 32000)]
DEBUG:txtai.workflow.base:Running Task #2
DEBUG:txtai.workflow.task.base:Inputs: [(array([[ 5.8028224e-04,  6.5627520e-04,  1.4432818e-03, ...,
         2.3652171e-04, -6.1154435e-04,  1.3553995e-03],
       [-5.9710984e-04, -4.7940284e-04,  3.2041437e-04, ...,
        -2.5115125e-03, -7.7958062e-04, -9.4321877e-05]], dtype=float32), 32000)]
Enter fullscreen mode Exit fullscreen mode

Wrapping up

This article demonstrated how to build a series of Generative Audio workflows for poems. This capability has potential applications in the creative field.

Are we at a place where we can have a full pipeline that takes a prompt and generates a full multimedia video? Not quite but we're quite close. Interesting times certainly are ahead!

Top comments (0)