Procedurally Generated Audio Methods And Their Potential In Games

As the game development industry continues to mature, games are becoming more sophisticated and complex. Procedurally generated environments now create experiences with an almost infinite number of possibilities, and developers can no longer predict or account for every possible type of interaction within the play space. Developers are also limited by memory and processing power constraints on products and must find new efficient ways of creating virtual worlds despite these concerns. Procedurally generated sound is a new and promising potential avenue for audio teams, and there are techniques in development that have potential in the three main subfields of sound in games. The first subfield consists of the dialogue from the player and other characters in the environment. The second subfield is the sound effects of interactions and triggers within the scene. Finally, music plays an essential role in creating the tonality and mood in a world. Procedurally creating sound is approached a little differently in each of these three fields to meet their specific requirements, and applying these methods has the potential to adequately meet the growing auditory requirements of the industry.

The first element of procedurally generating audio within games is the process of dialogue creation. Creating the dialogue is comprised of up to two components, each with a distinct purpose. The first component is generation of what is going to be said, which requires analysis of situational context in the game state at any given time. Each game contains scenes that store variables significant to the scene. A scene may represent a racing scenario for example, where each car has a place in virtual space, a speed, and a collision detection system. In addition, there is a win state that is triggered by a car crossing the finish line. All of these elements are contained within the game state. When procedurally generating audio in this scenario, there has to be a way for the game to make sense of the individual pieces of information in order to create contextual dialogue that may be relayed through the race’s announcer. Fortunately, there are a number of artificial intelligence techniques that are used to represent situations like this.

A finite state machine is a useful algorithm that relates variables together, often using discrete Boolean variables consisting of true or false evaluations to make decisions. In a shooter, enemies are commonly in one of a number of states, such as idle, attack, defend, or patrol and transitioning between these states are triggered by evaluating variables within the scene.

1_z-cGUTYS40cSjcMWgiOLww — Finite State Machine Example

As the game loop continues, the program constantly evaluates the associated finite state machine variables to choose the appropriate states at any given time. A change to a state is an excellent opportunity for the algorithm to choose dialogue for any given moment, taking the different variables in to account. For example, an enemy may see a player, triggering a shout or a call for backup, which would then alter the state of other enemies nearby with their own subsequent dialogue pieces that the program chooses such as a response on a radio that the enemy is holding.

Fuzzy logic is an algorithm that allows a program to interpret imprecise information to make generic decisions. The numeric data represented ranges between zero and one for each variable to represent a range of truth with zero representing a full false statement and one signifying a full truth statement. This algorithm has its use in creating a decision that changes over time because the boundaries of each variable can overlap. A good example of this is defining distances using more generic words such as very small, small, big, or bigger.

If an object moves away from something, an initial “big distance” variable will decrease from 1 to 0 while the “bigger distance” variable increases from 0 to 1 in a range. In relation to dialogue, the race announcer in the racing example will be given access to the delta change of the distances between two cars and may comment on how one is falling behind. At this point, the game can play pre-recorded dialogue for a number of different situations, or, using natural language processing, can generate situational sentences on its own.

Natural language processing (or NLP) is a technique that can be used to generate dialogue from scratch. By following pre-programmed grammatical rules involving syntax and semantics, NLP can generate different types of sentences. For example, a specific racecar behaves as its own object, so when creating a declarative sentence from the driver that involves the car, it will assign, “I,” as the noun. If it has evaluated that it needs to speed up according to fuzzy logic, NLP will assign, “speed up” as the verb. After that point, the generated sentence may become, “I need to speed up,” or, “I will speed up.”

The second component of procedural dialogue creation is the synthesis of analog output from the created sentences. Traditionally, games contain pre-recorded dialogue that may comprise a large amount of data when there are hard memory limits on a game’s size. Synthesizing voice is a relatively new technique that is being researched, but there are a few programs in development that have the capability of reducing this memory issue significantly. The first program comes from adobe called VoCo, which allows any voice to be replicated after listening to twenty minutes of dialogue. Twenty minutes of dialogue is still a sizeable amount to have to store for the algorithm to run efficiently and creating sentences costs rendering time that isn’t available with framerate requirements in games, so there is still work to be done. However, the second program, created by Lyrebird, aims to achieve the same goal but is significantly quicker in that it can replicate voice using a single minute of dialogue. The comparable degree of accuracy to regular human speech of these programs is debatable, but these algorithms have the potential to significantly reduce costs for voice actors as well as be used in combination with natural language processing by the game to procedurally address situations developers may not have thought of or don’t have the memory capacity to store associated dialogue.

In addition to dialogue, sound effects represent another component of procedurally generated audio. These sound effects are used to represent interactions within the gaming environment across a multitude of objects and materials in the same way that foley and moves effects are created in television and cinema. With the increasing capacity in both processing power and memory, games are getting larger and more dynamic, which exponentially increases the number of possible interactions. Traditionally these interactions are programmed individually, but the task becomes less feasible due to the time constraints. Procedurally generated sound effects offer a viable solution to this issue with enough processing power and memory dedicated to the role.

MIT’s artificial intelligence lab has created a program that behaves as a foley artist that examines frames in a video as well as sound patterns from different sources to replicate audio waves that produce the analog sound. The program was fed 978 videos with 46 620 actions and used deep learning to begin to associate different sounds with different actions. As a result, an online study revealed that the human subjects were nearly twice as likely to believe that the AI produced sound was the real sound over the recorded sound. In games, this algorithm can actually be refined in to a game state context.

Normally, the program has to analyze a frame to create context in order to produce a related sound, but a game contains metadata on each object viewed on the screen as well as tracks interactions between objects through the physics engine. Therefore, the context analysis part of the algorithm is no longer required as the program can simply tell it what exactly is interacting. This is expanded further in to games where objects are unrealistic.

Object metadata for the camera stored in the Unity game engine

For example, if a piece of wood that fell on a tile floor was created in an art style that is unable to be recognized by the context analysis part of the algorithm, then the physics engine would examine the collision, the object types on the terrain and wood, and use those labels as input to the rest of the algorithm. Another useful improvement on the base algorithm that games can employ is additional variable data related to the interaction. Object attributes contain velocity of the objects at the time of collision that is then processed by the physics engine, but these velocities can also be used to predict the decibel level of sound that is produced by a collision. Games may also track room sizes as well as materials used for objects, terrain, or walls, allowing for the addition of reverb to create an even more realistic feel. Depending on what characteristics developers use to associate with objects, procedurally generating sound effects becomes an effective tool for the expanding size of games.

Finally, procedurally generated music completes the set of audio types most commonly used within games. There are two general techniques employed in procedurally generated music that act on the mix. The first technique is to generate a piece from nothing. The second is to make existing pieces reactive to a scene. Both techniques can be used either together or separately, yielding an exponentially large number of possibilities to give the game an organic feel.

Procedural music compositions have a certain degree of risk associated with them in that what is composed may not be representative of the motif of the scene associated with it. Fortunately, an algorithmic music composition system such as Sakka allows developers to specify attributes associated with pieces prior to their synthesis in order to reduce the issue while still maintaining the ability to procedurally generate the piece. Many of these attributes are the same attributes specified in traditional sheet music. For instance, developers may specify the scale of a particular piece in combination with the number of notes to play and a list of time signatures associated with each note. The program will then choose which notes to play and match the piece to the specifications required. Examples of other specifiable variables include pitch, velocity, and panning. This creates a flexible system that can be used to generate melodies with controlled randomness.

Similar to MIT’s sound effects algorithm, a lab at Georgia Institute of Technology created a robot named Shimon, which was fed nearly five thousand songs of varying styles as well as over two million motifs, riffs, and licks of music. Once given a seed and four measures to use as a reference, Shimon composes and performs the rest of the newly created piece. Over time, deep learning has allowed Shimon to evolve from playing monophonically to creating harmonies and chords. In addition, just like Sakka, Shimon can be fed notes of different lengths to produce music using more of those lengths of notes. The algorithm behind Shimon has useful applications that are slightly improved upon existing Sakka software. Where Sakka’s notes are created more randomly after input, it can be said that Shimon’s compositions are more inspired from existing songs.

The use of metadata and a database system will improve upon this further. In a development environment, a program like Shimon’s can be fed very specific styles of music to gain its inspiration from in a similar fashion to using temp music in cinema to associate a certain feel with a scene. In the same way that objects in a game have data associated with them, a scene can be given a keyword associated it, which can then change based on events within the game state changing. These keywords would also be contained in the metadata of a song matching that keyword. Once a number of pieces with that same keyword are fed in of the style you aim for, compositions will be procedurally generated using that style. Shimon’s algorithm also has the ability to make specifications to the work like Sakka’s existing features, such as the timing or rhythm of notes.

Alternatively, one can use a sound manager to choose existing compositions to react to the game state as it changes. The first method of accomplishing this is called vertical resequencing. In vertical resequencing, compositions are created with the idea that they can be layered on top of one another to produce different levels of intensity to specific moods. For example, the lowest level of a track may be used to create ambiance using a single held string that persists throughout the scene. As intensity increases, a horn track may be added to the arrangement, with additional synths and percussion on top of that to create a combat track at the highest level of intensity. The second method is horizontal resequencing, in which entire compositions are strung together with a sound manager to meet the needs of the scene at any one time. As the game enters different states, the sound manager chooses a composition that was made to represent the newest state. Since states can change fairly often, these tracks are often kept short and looped until a new state is reached. At this point, the sound manager will either crossfade between the two compositions or wait until the loop of the first composition is complete before initializing the second composition.

Collectively, the combination of the procedural audio techniques in use as well as in development have the potential to increase the scale and efficiency of sound created for games without having to exponentially scale audio teams to accommodate for the size of the games being created. Music uses sound managers to dynamically change the experience of the game state using existing pieces. It also makes use of programs such as Sakka to generate compositions from scratch, with the possible integration of algorithms like that of Shimon to improve upon this method. Sound effects have been successfully replicated from an adequate number of sample material that fool test subjects more often than not as to what the real effect is. Employing this program in games would allow the game engine to handle the multitude of potential effects in a streamlined fashion in comparison to the original algorithm. Finally, dialogue can be generated by analysis of the game state through artificial intelligence techniques such as fuzzy logic and finite state machines in combination with natural language processing. Once a sentence string is chosen, experimental techniques created by Project VoCo and Lyrebird offer a solution to synthesize this audio. Taking advantage of these procedural audio generation techniques allows the game development industry to continue to mature, despite the complexities of games in this age.

Bibliography

AdobeCreativeCloud. “#VoCo. Adobe MAX 2016 (Sneak Peeks) | Adobe Creative
Cloud.” YouTube, YouTube, 4 Nov. 2016, www.youtube.com/watch?v=I3l4XLZ59iw.

Dent, Steve. “Machines can generate sound effects that fool humans.” Engadget, 14
July 2016, www.engadget.com/2016/06/13/machines-can-generate-sound-effects-that-fool-humans/.

Di Prisco, Rom. “Adaptive Music” INFR4391. Music Composition and Sound Design For
Games, 10 Oct 2016, Oshawa, University Of Ontario Institute of Technology

Frishert, Stijn. “Implementing Algorithmic Composition for Games.” Utrecht School of the Arts, 12 Aug. 2013, stijnfrishert.com/wordpress/wp-content/uploads/implementing_algorithmic_composition_for_games.pdf.

Krotos. “AI: The Future of Sound Design | Krotos News.” Krotos, 22 Sept. 2017, www.krotosaudio.com/2017/09/21/ai-the-future-of-sound-design/.

Lu, Fletcher. “Decision Making - Finite State Machines.” INFR4320. Artificial Intelligence, 24 Sep. 2016, Oshawa, University Of Ontario Institute of Technology.

Lu, Fletcher. “Fuzzy Logic in Games.” INFR4320. Artificial Intelligence, 16 Oct. 2016, Oshawa, University Of Ontario Institute of Technology.

Lu Fletcher. “Natural Language Processing.” INFR 4320. Artificial Intelligence, 8 Nov. 2016, Oshawa, University of Ontario Institute of Technology.

Maderer, Jason. “Robot Uses Deep Learning and Big Data to Write and Play its Own Music.” Georgia Tech News Front Page, 13 June 2017, www.news.gatech.edu/2017/06/13/robot-uses-deep-learning-and-big-data-write-and-play-its-own-music.

Vincent, James. “Lyrebird claims it can recreate any voice using just one minute of sample audio.” The Verge, The Verge, 24 Apr. 2017, www.theverge.com/2017/4/24/15406882/ai-voice-synthesis-copy-human-speech-lyrebird.

Procedurally Generated Audio Methods And Their Potential In Games

Bibliography

Leave a Reply

Considering Obstruction/Occlusion (Brainstorm Thought Process)

Dynamic Music System Creation

Share my work:

Bibliography

Leave a Reply

RELATED STORIES