Sound designing the metaverse

5 min readJan 19, 2022

Recently, we’ve been working a lot with creating sound and music for virtual events taking place in real-time rendered 3D worlds. At these events participants can mingle, watch keynotes and chat in the shape of avatars.

*One of the virtual events we’ve worked on, the* *Cardano Summit* by *ActiveTheory*.

One of the immediate challenges when composing for an event that runs for several hours or days is creating music that doesn’t become irritating or repetitive for the users. Traditionally this is handled by creating longer compositions or crossfading between multilayered loops.

This is not ideal since it requires creating and loading a lot of sound assets. We decided on another approach; to use a more generative music system.

The recipe

To find the right feel we started out in a linear way by creating short loops. We wanted the music to span across different moods but still have a cohesive sound to it, giving the listener a sense of discovery as the music transforms from one section to another.

Establishing a structural framework for the music was necessary in order to transform the linear loops into something that could be played back procedurally. This framework consisted of defining a number of parameters for each section; the harmonic progression, the melodic content, an instrumental arrangement (that could be used for each section or across sections) and associated effects and mixing.

We thought of this framework as a musical recipe. Once we had this in place we were able to get started on structuring the arrangements and tweaking parameters to make it sound as we desired.

A good recipe for this project turned out to be having a chord progression consisting of three to five chords played by a multi-layered synth pad and two or three generated melodies played by various samplers.

The audio files of each chord (and their respective layers) were exported separately to be played back in any order and each one contained its own predefined melody. We then separated the note values, durations and velocities so that we could affect each property on its own. For example, we could randomly choose between a set of defined note durations but keep the order of the notes intact, resulting in an ever-evolving, but recognizable melody.

Structuring the music this granular made it simple to have the music reacting to the virtual world. For example when the avatar starts running we change the intensity of an arpeggio and when the user discovers an artifact we play a little fanfare of the next three notes of the current melody.

In total we made six large sections following this recipe. One decision we made early on was to let the music advance by itself and that progressing to a new musical section wouldn’t necessarily need to be tied to an event within the experience.

When working with adaptive music you often structure the music to always follow what’s happening on the screen. For example, the music might change when you advance to the next level in a game. In this case, a user might stay in the same world for the entire experience so it made sense to change the music based on time instead.

Klang

Klang is our bespoke audio library, it handles everything one needs to play audio in the browser, like sequencing, samplers, effects etc. To make the integration as straightforward as possible we usually create a custom controller class for interfacing with Klang. The communication between the main application and Klang is then handled by triggering project specific sound events.

Spatial audio

Using 3d panners is quite performance intensive and therefore something to be used wisely. We usually set our web audio panner nodes to use the simpler panning model “equalpower”. This often works really well and is not as performance demanding as the “HRTF” mode.

By piping a sound through Klang’s audio bus system which handles signal routing and applying effects, we can seamlessly transition between spatialized and normal stereo playback. This is effective for video assets that both live in the 3D world and as a “theater mode” video.

Creative performance limits

When working with the web, performance is always of the utmost importance. The number of instrument instances that we are able to use is a trade-off between performance and how dynamic and varied the music should be.

Designing for the browser meant we also had to keep the download size of audio files as small as possible to avoid waiting time for the user. In the end we loaded 3.8 mb of compressed audio for the generative music and this allowed for a few days of varying music.

For performance reasons we chose to use sample-based instruments rather than embedding synthesizers. That meant we couldn’t play with as many parameters to make the sounds feel alive. The automations of the oscillator waveforms had to be baked into the audio file. Other automations that happen on a larger time-scale such as increasing/decreasing reverb, delay, EQ and volume could be controlled via Klang when playing back the chord samples.

The Klang audio framework has a sophisticated effects bus system that we used to make the synth pads come alive.

The Future

We are lucky enough to be building on top of a number of open source libraries as a base such as Tone.js and the standardized-audio-context and we are rapidly augmenting this base with custom code for easy drop-in use, a variety of configurable sequencers, effects, sound palettes and spatial audio tools.

Coming from the perspective of a group of composer/sound designer/software engineers, we want to put creative people in the driver’s seat using these systems. We are continuing to focus on building out a highly modular generative toolset that excels on the performance front while at the same time reducing the bundle size for the end user and that allows composers to expand what they can do creatively with audio on the web.