Speaking machines

Name: Orpheus Instituut
Price range: $

wearing the skin of a synthesised voice

Artist statement by Lottie Sebes

Share this article

go to intro

When the machine speaks, we are compelled to ask - who are you?

Introduction

The sculptural multi-channel sound installation Mouthpiece explores the unease and unaddressed intimacy of synthesised voices. It challenges paradigmatic, asymmetric power relationships between the silent operator of a vocal synthesiser and the voice it produces. Mouthpiece delves into the fleshy and personal relation of holding another human’s throat in your hands, of assuming their identity, of wearing the skin of their voice.

In the following text, I will reflect the genesis, presentation, and context of my artwork, positioning it as a performative, critical reflection on the philosophical, social and media-historical dimensions of synthesising the human voice.¹ This research is informed by media archaeology in that it seeks to understand new media and technologies through non-linear connections with their outmoded ancestors. To that end, I will map relations between past and contemporary manifestations of a range of voice replication technologies, considering their social, political and cultural dimensions. This includes contemporary systems such as chatbots, digital assistants, text-to-speech systems, and machine listening applications, as well as the remarkably long history of devices involved in the challenge of mechanically replicating human speech. As a sound artist using synthesised voice as sonic material, I wish to weave connections across unnecessary dichotomies in voice studies – between materialist approaches and humanist understandings of the voice. By infusing vocal engineering with poetics and social critique, I position this work at the intersection of materialism and symbolism - of anatomy, technology, and social politics.

Following a short description of the installation Mouthpiece in section one, in section two I will touch on some historical examples of the vocal tract’s central role in vocal synthesis and how replication of the human throat continues to play a role in contemporary speech research. Here, I will ponder the idea that in replicating this unique part of a human’s body, we could also gain access to some part of their person. After some reflections on my personal interaction with the human model of my own speaking machine, in section three, I will focus on the ways vocal identities, similar to the vocal tract, can be fragmented or abstracted, made available for possible use or exploration by a speaking machine’s engineer or user. I will trace connections between contemporary and historical speaking machines which have used the perceived identities of their machines as a mask, as a costume, as a 'skin' to be read by other humans within the complex of our social, cultural, gendered and racial codes. In section four, I will return to a more detailed description of Mouthpiece’s presentation and conception, discussing the ways this piece wrestles with the complexities outlined in the first sections of the paper with reference to theories of voice from Bulut, de Certeau, Cavarero, Barthes and LaBelle.

Fundamental to these theories are conceptualisations of the voice as a marker of subjecthood or self, as an index of the body, and as an entity which operates in the interstice between bodies. As non-human entities which, in the act of voicing, intrude into this arena of subjecthood, corporality and social relation, I will position vocal synthesisers and simulators as philosophical and political machines. In the act of apparent subjecthood being passed between bodies and machines, a number of questions can be raised. What power relationships are at play when I bring another’s voice, body or image to speech? What happens when bodies and voices are also parts of machines? And do I also invoke something of that person when I build or operate such a machine?

As the following text will outline, vocal synthesis methods often involve the dissection and melding of fragments of bodies and voices, allowing unreal fabrics of personhood and identity to be engineered and applied like a skin. With the installation Mouthpiece and the following written statement, I wish to draw attention to the strangeness and the problematics of such a practice, introducing glitch and noise to the deceptively ‘clean’ signal of the cybernetic voices which are increasingly present in our lives.

Mouthpiece

Mouthpiece was exhibited at the Collegium Hungaricum Berlin from May 26-29, 2022, as part of the Sound Studies and Sonic Arts Master Exhibition of the Universität der Künste Berlin. It consists of five sculptural elements, each approximately twenty centimetres in height, which act as acoustic filters for a spatial composition. These objects are 3D printed models of a woman’s throat. As she spoke words containing five German vowel sounds ‘Bahn’ (A), ‘Beet’ (E), ‘mit’ (I), ‘Offen’ (O) and ‘Butter’ (U), lying flat in an MRI machine, a part of her was transformed into a speaking machine.

The exteriors of these printed copies have been melted or coated in layers of latex and paint to create organic and skin-like textures. Suspended at various ear-heights, their biological forms each have a unique mouth-like orifice, either gaping open or with pursed lips, and an undulating hollow passageway which bends to meet a clear plastic hose, curtained at the top by two testicle-like sacks, which winds downwards to the floor and connects each object to a minimalist white box. Inside, a talk box propels an audio signal through each of the hoses to resonate in the chamber above. One object has a higher pedestal - the tube protruding from its base is a white-coated metal conduit, which curves in a circular form until it meets a white plinth below.

The sounds emanating from these uncanny objects are at once digital and vocal and seem to be in conversation with each other across the room. Moments of silence are punctured with fragments of human timbre or the cracks of glitchy bites of sound. These interrupt and intersect with drones which may either ring out over the course of a human breath or continue in an impossibly extended phoneme-loop, with the perfect repetitions of this cycle generating a rhythm over time. These sounds were generated using an artificially intelligent neural network for sound synthesis, trained on a data set containing both my voice and the voice of the woman whose vocal tract is replicated in the piece. The chattering and droning characters seem to be in dialogue with one another, and as visitors enter the space, their movements determine the unique mix of sounds their hear in this conversation.

Meanwhile, somewhere in Saxony, the real human owner of this throat may be speaking to a colleague, singing in the choir, or greeting a checkout operator at the supermarket. >>

Vocal tract: the form and trace of the voice

Since the earliest verifiable accounts of operational speaking machines in the late eighteenth century, the shape of the vocal tract, especially for producing vowel sounds, has been a key point of study and a starting point for the emulation of the human voice. At this time, there was a consistent approach to designing these machines which focused on understanding and replicating the anatomy of the vocal organs and their configuration. For example, Christian Gottlieb Kratzenstein’s ‘vowel pipes’ from 1779 consisted of a free reed used as an artificial glottis, which was combined with five pipes shaped according to the position of the vocal tract and its components, which, he expected, could be easily adapted by organ builders to be transformed into an organ-like instrument (Hankins and Silverman 1995, 190). The accompanying essay to these devices, which set out to determine the nature of vocal vowel sounds, was founded in anatomical descriptions of what we now know as the vocal tract (ibid., 189).

In the 1780s and 1790s, the automaton builder Wolfgang von Kempelen used a similar premises to Kratzenstein to create a machine which did not only emit static vowel sounds but could be physically manipulated by the operator to produce a flow of vowels and consonants, forming words and sentences. In his development of this speaking machine, Kempelen looked for mechanical substitutes to directly mimic the anatomical parts and the movements he deemed necessary in vocal production. For example, the lungs were substituted with a pair of bellows and tubes were used as nostrils (Hankins and Silverman 1995, 93). Over the following century, Kempelen’s accompanying text ‘The Mechanism of Human Speech’ would become a “milestone of early phonetics” inspiring and influencing generations of both linguists and physicists, who were varyingly interested in the origins and functioning of language, or in the acoustics of vocal sound production (Brackhane, Sproat, and Trouvain 2017, 23).²

In the 1820s, the acoustical physicist Félix Savant used a cast of a cadaver’s throat to support his theories of voice, arguing that the shape of the vocal cavity would allow an artificial larynx based on a hunter’s birdcall to assume the range of the human voice (Hankins and Silverman 1995, 199). At this point in the development of vocal synthesis strategies, a new relationship between body and machine began to emerge. In these cases, the human anatomy was no longer emulated in a generalised form, but rather an index, a carbon copy, of an individual’s throat was produced. Later in the nineteenth century, such exact replication would also be at the heart of the work of the French physiologist Georges René Marie Marage, whose speaking machine contained resonant cavities cast from moulds of mouths, including lips and teeth (ibid., 211).

Marage and Savant’s oral moulds cast a direct line to twenty-first century research in the fields of medical physics, archaeology, and speech science. In January 2020, a team of researchers working at the University of London made a Computerised Tomography (CT) scan to capture the vocal tract of a 3000 year old mummified individual, 3D print a replica of the his throat, and synthesise a “vowel-like sound based on measurements of the precise dimensions of his extant vocal tract”(Howard et al. 2020, 1). The individual in question was Nesyamun, a priest who lived in Thebes in the eleventh century BC, and this experiment was widely publicised as the successful bringing of his silent voice back to life (BBC News 2020). Nesyamun’s identity was presented as an important reason for the study, not only because of the age of his body and his importance as a historical figure but because, according to the claims of one of the contributing scientists, David Howard, Nesyamun “…wish[ed] that his voice would somehow continue into perpetuity”(Fleur 2020). According to its conductors, the study therefore represents “the fulfilment of his beliefs” (Howard et al. 2020, 4).³

A very similar process of medical imaging, 3D printing, and excitation of the printed form was also used to create the Dresden Vocal Tract Data Set (DVTD), published in the same year. In this study, the 3D-printable scans of two individuals’ vocal tracts were made freely available for download and print from the comfort of one’s own home for education and research purposes.⁴ Despite the anonymity of the speakers represented in the DVTD, I was drawn to the idea of their vocal individuality. In the same way that audiences are drawn to the idea of being connected to Nesyamun through the possibility of re-sounding of his voice, I too speculated that by sounding the 3D replica of the vocal tract which I held in my hand, I was somehow getting closer to the anonymous woman from whom it was moulded. Her identity needs to remain hidden here, so throughout the text I will call her Christine.

The perception of the voice as a unique connection or marker of the individual subject has been a persistent paradigm of Western thought, despite its existential disruption by the invention of speaking machines. Aristotle’s (1907, II:420 b11-12) claim that voice emanates from beings with a soul continues to resonate in more modern conceptions of the subject, for example in the work of contemporary philosopher Adriana Cavarero (2005, 3), for whom the voice is “capable of attesting to the uniqueness of each human being”. However, the voice is not merely a passive characteristic. It is an active emanation from the body, passing through space before entering another. As Steven Connor (2000, 4) articulates, “my voice is not incidental to me; not merely something about me, it is my way of being me in my going out from myself”. In this way, as Freya Jarman-Ivens (2011, 3) underlines, voice is an entity which comes and goes, joining bodies to one another, opening a “third space” in between, and crucially, taking “part of my body with it - the sound of its own production”. My plastic replica of Christine’s vocal tract seemed to imbue a new literalness in this idea of the voice carrying the body with it. I wondered if it also carried something of her personhood.

May 3^rd, 2022: a video-chat with Christine

That object, the printed vocal tract, she tells me, it doesn’t have anything to do with her as a person. It’s been very precisely taken from her body, but it’s so abstract, it could be anyone.

Having been on both sides of the researcher/participant relationship, as both a modelled subject for the DVTD and as a speech scientist modelling another woman’s voice, the issue of subjectivity and personhood has rarely, if ever, entered Christine’s consciousness. When she was undertaking her PhD, the human subject of her research was also a fellow scientist, with a similarly uncomplicated attitude towards the vocal tract replication process. Christine says the mediation of the computer helped to keep the distance, much more than when she handled a cast of the subject’s teeth. She also wondered, but only sometimes, how it would feel if she did manage to synthesise a voice that really sounds like this person. What would the implications be? It’s better, she thinks, for humans to always know if they are speaking with a machine.

In the MRI machine, Christine swears she could feel the magnetic waves passing through her. It’s interesting, exciting, she tells me, a rare chance to look inside your own head. She endured the task of lying completely motionless, remaining disciplined, becoming one with the machine. Even a few millimetres of movement would blur the image. You have to retreat from yourself for the sake of science, she says, it’s a patient performance.

On other days, she takes part in a different kind of performance, adding her voice to a collective instrument - the choir. On these days off, she still thinks about vocal tract when she sings, about her resonant body becoming one with those around her, the waves of sound passing through her, being a part of something immense.

⁵

Synthetic voice as vocal mask

The choice to hide the identity of the modelled speaker - while not without valid and evident reason - is a choice that has arisen continuously with the increased use and development of many kinds of vocal synthesis, raising questions about subjectivity, agency, and power. When vocal synthesis is used in commercial contexts, the goal of producing the most natural-sounding voice is prioritised, as opposed to articulatory models like the DVTD, in which understanding and replicating the anatomy of speech production leads the process. Many commercial contexts use concatenative systems, in which the body and voice of an individual are nonetheless strongly implicated and represented in the outcome. In this case, a real human speaker is required to record many hours of speech, from which, to put it in the simplest terms, small units of speech are selected, and then re-ordered to reconstruct phonemes and words. Most voices of the digital assistants we use today have been made this way, with hours of invisible (but not silent) labour going into the creation of the voice in each language and accent. When it comes to our digital assistants for example, the identities of each of these human speakers are not officially revealed by the companies but are instead impossibly homogenised into singular constructed identity under names like ‘Siri’ and ‘Alexa’.

Although the individual identities of the modelled speakers are masked for such applications, the cultural and social identifiers and which we commonly inscribe in the voice, such as age, gender, sex, class and ethnicity, take on a prominent role nonetheless. In 2018, it was estimated that over 92% of the U.S market share for smart-phone assistants had voices which are commonly perceived as feminine (Robison 2020). ⁶ Apple, Amazon and other major corporations behind these digital assistants have also come under criticism in recent years for reproducing gender stereotypes with their bots by gendering them artificially, using high pitched voices, some traditionally female names and even “an often submissive or even flirtatious style”(Specia 2019). Here, we see a parallel between contemporary service bots and the telephonic voices of twentieth century switchboard operators. These women were selected to embody the supposedly feminine qualities of helpfulness and altruism and were considered to more palatable for potentially disgruntled customers. The predominately male staffing of contemporary computer engineering teams also reveals a kind of ventriloquism, whereby machines which are perceived as feminine are controlled by male programmers, speaking in reconstructed micro-fragments of real women’s voices to control the way they interact with society. ⁷

Although between 2017 and 2020, the companies behind Siri, Alexa, Cortana and Google Assistant have responded to such criticism by changing the way their bots respond to harassment and flirtation (Robison 2020), recent topics in online forums suggest that contemporary applications providing social chatbots such as Replika are being used by men to create ‘virtual girlfriends’ who they can abuse at whim, posting the results on the online for the supposed amusement or comradery of others. (ReplikaSingularity 2020; Taylor 2022; Bardhan 2022). While not an exclusively audio-based tool, Replika features vocal synthesis technologies which provide users the possibility to speak to their artificially intelligent (girl)friends on the phone. These bots, like their grandmother - the first chatbot, ELIZA - are designed to be socially engaged listeners and are marketed on their website (Replika 2022) as “the AI companion who cares”, who is “always on your side”.

The very first phonographic recordings of women’s voices were also products of many hours of hidden vocal labour (The Henry Ford Museum of American Innovation, 2019). Thomas Edison’s talking dolls, manufactured and sold from 1888 to 1891, featured the voices of up to eighteen young women who worked in factory cubicles, yelling nursery rhymes for hours to individually record them onto miniature wax phonographs, which would be placed inside the tin chest of humanoid dolls. Through the re-appropriation of their voices, these women too had their identities subsumed into that of a machine-companion. Much like our contemporary digital devices, these dolls were used to entertain children, with the labour of singing nursery rhymes being passed from the mother to the women in Edison’s factory. It was hoped that these machines could help the children learn and sing their prayers, perhaps even take them to bed without human supervision. One doll hauntingly recites: “Now I lay me down to sleep / I pray the Lord my soul to keep / If I should die before I wake / I pray the Lord my soul to take”.

Brian Kane’s (2014) analysis of the machine that speaks or writes in the first person, using the pronoun I, is illuminating in this context. Referencing the linguistic theory of ‘shifters’ – terms whose meaning shifts according to the situation or speaker. Kane (ibid., 185) argues that the use of this pronoun “point[s] equivocally back towards its source” which could be understood either as “the soul that animates the machines itself” or the soul of the person whose voice is reproduced by the machine. But as the voice used to speak for our digital assistants differs based on the location of the user, when Siri says I, the ‘soul’ referred to, if not that of the machine itself, is not a singular identity, but is rather made up of fragments of mostly anonymous individuals. The appropriation of R. Murray Schafer’s (1997, 90) term ‘Schizophonia’ which he used to underline the detrimental effects of the separation of a natural sound from its source via recording or transmission of it, may be appropriated here to evoke this fragmentary split of the subject represented by the ‘I’ when Siri speaks.

It is hard to imagine how a speaking machine like Siri could be related to Kempelen’s historical speaking device. Perhaps our digital assistants, with their hidden fragments of human voice inside, have a stronger resonance with Kempelen’s other famous automaton – the Mechanical Chess-playing Turk – an elaborate hoax which fooled many at the time who could not identify the human operator hiding inside. In Siri too, human labour and identities are hidden in the effort to create a more seamless presentation of the machine. The identities of the speakers are dissolved into the brand-identity of the product they speak for.

There is, however, another trend in contemporary voice synthesis, in which the choice has been to reveal aspects of the identity of the modelled speaker, to very different effect. Consider the web-based company Lovo.ai, which advertises itself as a “Next-generation AI Voiceover & Text to Speech Platform with human-like voices” (LOVO 2022a). LOVO offers a text-to-speech system (TTS) marketed as a cheaper and faster alternative to hiring a voice actor, aimed at “film makers, advertisers, game developers and creators” (ibid. 2022b) with which, like many other similar products, the user can select a voice to read a script. Unlike other TTS models however, LOVO has created a self-contained digital marketplace in which users can subscribe not only to buy the use of voices, but can also create their own ‘voice skins’ to be added to the market. ⁸ The website encourages potential customers to “create your own AI voice-double for free… start selling worldwide & make $$$ while you sleep!” (ibid. 2022c). LOVO offers a 50% revenue share per use of your voice and the marketplace currently features 180 voices, listed with names, photographic headshots, a flag icon to indicate the spoken language or accent and keyword tags, like #Female #Young Adult #Ad #Cheerful #Engaging #Excited, or #Male, #Middle-Aged, #Audiobooks #E-Learning, #Low-Pitched #Powerful. After hearing the chosen voice speak the provided text, the user can add emphasis to or even manually edit the pronunciation of specific words.

Scrolling through the names, faces and flags in the LOVO digital marketplace, I am reminded of Joseph Faber’s talking machine ‘Euphonia’ of 1845, which came with two possible masks to be attached in front of the moving apparatus of the mechanical mouth: the face of a woman, and another inaccurate stereotype of a ‘Turk’. Characterising humanoid automata as women, children, and people of colour, was a practice with a strong tradition by this stage (the Turk was perhaps a nod to Kempelen’s infamous chess-player) which affirmed their ‘otherness’ in relation to the white male inventor, and in the context of the machine’s operation, placed them under his control (Adnet 2020, 32; Sebes 2020, 11).

Euphonia was exhibited in 1846 in the Egyptian Hall in London, a space in which successive exhibitions of colonized individuals and communities were commonly brought for exploitative and sadistic display. The space hosted human-zoo-style exhibitions of indigenous peoples including Laplanders in 1822, Inuit in 1824 and Ojibbeways in 1843 (Peters 2020, 142). This space was a central locale at a time when the idea of ethnography was in its infancy and when a form of White Supremacy which was “presumably backed by ‘scientific’ evidence” was on the rise (Altick 1978, 279). In the same year in which the speaking machine Euphonia was displayed at the Egyptian Hall, two orphaned indigenous San children, an eight year old and fifteen year old, performed there as ‘curiosities’ alongside a baboon after serving as ‘living illustrations’ to a paper presented before the Ethnographical society (ibid., 279–80). The Society had been founded in 1843 and was placed under the heading 'Zoology and Botany' (ibid.) In such a context, colonialist projects were aided by public “commodification of humans and nature more broadly”(Peters 2020, 143).

In this hall, visitors paid one shilling to witness Joseph Faber’s Euphonia in the act of mechanical song and speech in multiple languages. While the voice of Euphonia was unaided by any real human voice, and did not change according to the costume attached, the attachment of the masks - this superficial image, this costume or ‘skin’ - to the front of its large mouth, whose jaws opened to expose “artificial gums, teeth, and all the organs of speech”(Hollingshead 1895, 1:68) nonetheless allowed its operator to invoke another’s identity, for commercial purposes, through the use of simulated speech. In the context of the Egyptian Hall, these masks were perhaps designed to increase the entertainment value or relatability of the machine, which, despite Faber’s best attempts, was perceived as more haunting than amusing by most accounts.

Euphonia’s masks and LOVO's voice skins exemplify a transhistorical paradigm of speaking machines being used like a vocal mask, as a tool to invoke the identity of another. This act naturally brings up questions of power, in relation to stereotyping and exploitation. It is important to recognise that the names, faces, and binary gender labels on Lovo.ai, while in some ways uncovering the ‘source’ of the technologically mediated voices, do little to reveal the complex identities of the people represented there. Instead, identities are impossibly compressed into categories to turn qualities of voice and personality into selling points, synonymous with brand identity. Like the masks of Euphonia, in the LOVO voice skin marketplace, the identity of the individual becomes a costume worn by the company that employs it. But, as Nina Sud Eidsheim (2019, 3) argues, neither voice nor vocal identity are “situated at a unified locus that can be unilaterally identified”. Our culturally derived systems of gender, race, age and ‘likeability’ in voice create taxonomies to which our real voices often do not conform, enforcing ideas of essentialism which prescribe, for example, ranges of pitch to gender and to timbre to race.

This codification of the voice, of affect and of identity is also taking hold in the burgeoning field of machine listening, with an increasing number of artificially intelligent software products being used to purportedly “evaluate – and to predict – a speaker’s mood, personality, truthfulness, confidence, and mental health” as well as immigration history “based on algorithmic evaluations of the acoustic parameters of the voice” (Feldmann 2016, 2. See also Abu Hamdan 2014). Jessica Feldman (2016, 18) applies Marx’s conceptualisation of commodity fetichism to this trend, highlighting that in the “experience of being estranged from, yet materialised in, the commodity…the subject begins to see himself the way he sees his products: as a source of profit. Eventually, he regards others the same way.” But under post-industrial capitalism, the most significant sources of profit have shifted from the production of physical objects to the immaterial production data about our behaviour, psychology, politics and identity. In this way, the differences between products, individuals, and inner experience has become blurred (Luciani 2022). In this network of social and economic forces, both ‘voice skin’ products and machine listening algorithms commodify and ingurgitate identities and selves. They “assert certain forms of recognition of the self and other” (Feldmann 2016, 3) by upholding and enforcing taxonomies of voice which are determined by the capacities of the computer, market-based value systems, and problematic border politics. In this process, “Our very humanity and relational capacities become alienated as a potential source of profit” or perceived risk (ibid., 18).

With these digital voicing and listening technologies becoming increasingly present in fields as diverse as personal assistance, friendship, voice acting and immigration policing, the necessity to critically examine the way complex identities are being forced into algorithmic taxonomies of voice is urgent. As the following final section of this text will outline, while Mouthpiece does not address these topics head-on, it starts one part of this process by de-naturalising our relationships to synthesised voices and highlighting the impossibility of truly speaking in the voice of another. Paradoxically, it does this by returning to the flesh, by reminding us of the body and the organs from which the voice comes, and by plainly exposing the stitches between the skins of the many voices it wears.

Grafting new voice skins

From the beginning of the process of making Mouthpiece, I wanted to create my own speaking machine – not a vocal synthesiser per se - but a material body imbued with a sense of subjecthood through its vocalisation. I also knew I wanted to work with the vocal tract models from the DVTD, and therefore, that I would be dealing with something I saw as a trace of a real, living, individual woman. My goal was not to provide answers to all the questions raised by such a practice, but rather to continue asking questions, and to inspire these queries in the minds of my audience.

In contrast to the doctrine of scientific objectivity, which prescribes a depersonalized distance and thereby isolates itself from the fleshy, personal and intimate nature of the voice, Mouthpiece borrows bodily artifacts taken from scientific contexts and highlights them as parts of a person. As my video-call conversation with Christine revealed, this personhood or subjectivity is an aspect that I myself have invoked. While the objects are copies which come from her, the idea of them being part of her is something I project onto them. For this reason, in this piece I acknowledge a bi-fold subjectivity: Christine’s, which is engendered through the presence of a part of her body and a part of her voice, and my own, which shaped the form and content of the work. Unlike the silent, hidden or disassociated user of many other vocal synthesisers described in the previous sections of this text, I wanted my voice to be implicated in the work too, for the risk and vulnerability of the vocalisation to be shared. Mouthpiece was to be a work which makes the speaking machine more transparent, exposing the bodies and the operators behind it.

The sounds I have used to create the composition come from a range of sources, the primary one being a recurrent neural network (RNN) for sound synthesis called ‘PRiSM Sample RNN’.⁹ Other sonic components include breathing (produced naturally and with mechanical and electric pumps), nonsense phonemes created by testing the limits of commercially available TTS platforms, and my natural voice. Some natural vocal recordings are taken from the training data I provided to SampleRNN, while others are vocal imitations of the resulting output. With this reflected mimicry, a feedback loop is forged between human and machine, where each purports to speak for the other.

Sample RNN is a computer-assisted composition tool which uses artificial intelligence to generate “new audio outputs by ‘learning’ the characteristics of an existing corpus of sound or music”(‘PRiSM SampleRNN’ n.d.). This is a recurrent, predictive process which finds patterns in the given audio (the dataset) and then generates sounds based on the likelihood of these waveforms occurring in the dataset. In my case, I trained Sample RNN on two voices, my own and the voice of Christine, to create a kind of mixture, or a child of our two voices. I also consciously used generations from the model which were from earlier stages of training, so that the likeness to the human voice was not yet well rendered, but rather in an underdeveloped stage of the system learning how to sound ‘natural’. This produced what might be described as a ’digital’ or ‘glitchy’ timbre, with fragments of a more natural sounding vocal quality still faintly perceptible. As a result, the traces of three agencies are present in these sounds – mine, Christine’s and the machine’s.

Zeynep Bulut’s (2011) concept of the voice as skin: as both a boundary and a point of connection between the inside and outside, and also between various bodies, is useful to understand this connection between agencies. In stark contrast to the ‘voice skins’ in the LOVO marketplace, which are treated as a surface but not a depth, Bulut (ibid., 36) uses the metaphor of the skin to cast the voice, as “as a mixture—a meeting zone”, “a shared – common – space” (ibid.) that “transmits a certain inside to outside, but becomes the very inside and outside at the same time” (ibid., 41). This contingency is at the core of the interaction between the three agencies in my piece. By creating a machine that fails to speak naturally in our two human voices, I wish to point to the discomfort and strangeness of ‘wearing the skin’ of another’s voice.

Following this theme of skin, I was inspired by eighteenth century automata-builders who, in attempts to make machines that mimicked living things, often used systems and materials that seemed organic, such as vocal tracts. As Jessica Riskin (2003a, 112) states, in this era there was an “assumption that an artificial model of a living creature should be soft, flexible, sometimes also wet and messy, and in these ways should resemble its organic subject”. Riskin (2003b) has located the genesis of artificial life with a particular eighteenth century automaton – Vaucanson’s defecating duck. Starkly contrasting the predominantly virtual, screen-based, or sleek-robotic manifestations of artificial intelligence today, this avian ancestor of artificial life was a machine that shat. At two ends of a spectrum of a human’s possible expulsions, the eighteenth century invention of both defecating and speaking machines represented a test to the limits of the reproducibility of exceptionally organic or human capacities. Riskin also notes that the attempt to replicate such organic processes has made a comeback in recent decades with the emergence of computer-based neural networks, such as Sample RNN, which are structured in a way which reflects the behaviour of the human brain.¹⁰In Mouthpiece, I used organic and skin-like materials, textures and colours in a confronting or even repulsive way to lean into this history of machines which imitated the organic. Layers of latex cover the 3D printed, micro-polygonal plastic structure with a more organic form and texture. The melted plastic on the outside of one tract undermines the perfect precision of 3D printing technology in favour of an image which might have something to do with consumption, digestion or expulsion, or the very fluidity of the voice.

The abject quality of the sculptures is heightened by the noisy nonsense of the vocal sounds which emanate through them. Sample RNN is very different to TTS systems, in that is does not contain information about the way the fragments of sounds that it analyses and produces should be formed into language. Hence, the model can only generate what we might call the sonic patterns of an individual’s speech – the rhythm, timbre, intonation and so on, without any intelligible words. These tiny fragments of sound are no longer the building blocks of speech but rather the sounds of pure vocality.

This means that the vocal tracts speak in a kind of glossolalia, as theorised by Michel de Certeau (1996, 29) “a fiction of discourse [that] orchestrates the act of saying [ I'acted e dire] but expresses nothing”, an expression of speech without logos. This division or distribution of voice into the bodily and the semiotic is at the core of many theories of voice, including Roland Barthes’ seminal theory of the grain of the voice. Barthes’ (1977, 182) considers two qualities here, the phenosong on the one hand: qualities which come from “the tissue of cultural values” such as the “structure of the language… the composer’s idiolect, the style of interpretation… and on the other, the genosong or grain of the voice, which stems from the very materiality of the body. Adriana Cavarero (2005) too, seeks to elevate the bodily qualities which are in excess of the semiotic function of voice, such as breath, to treat voice first as foremost as sound. But how are these categories affected in the case of the voice generated by Sample RNN? In the babbling machine, despite the lack of both embodiment and logos, there remains a sonic excess. I see this remainder as the remnants of both body and language – the shards left behind once these pillars are removed.

Mouthpiece Audio Excerpt #1

00:00 / 00:00

Mouthpiece Audio Excerpt #2

00:00 / 00:00

For a large portion of my process, I aspired to combine the vocal tract forms with mechanical breathing devices made of bellows, to connect the contemporary models of the vocal apparatus to their historical predecessors and to imbue the stillness of the objects with a lifeforce engendered from the movement of automated parts. I explored driving two separate bellows with different motor-driven mechanisms: one pulley-based system, and another with a crank shaft attached to a flywheel. The decision to leave these aspects out of the final work was driven by a range of technical factors, but ultimately, I found that the animism I sought did not require moving parts, but was amply provided by the vocalisation. The aspect of breath was further incorporated into the composition through the inclusion of recordings of these bellows in action. Thus, the machine-voice was accompanied by a machine-breath, giving the object a seemingly bodily presence. A sense of movement was also generated in the piece through multi-channel sound spatialisation, which I achieved using the software Spat as a Max/MSP plug-in, connected to the digital audio workstation where the piece was composed (Ableton Live). This software allowed me to have eight virtual sound objects, five of which were fixed to the five vocal tracts, and three which I automated to move around the space throughout the composition, creating sense of spatial environment and shared voice between the objects, as well as individual voices and a feeling of a ‘conversation’ between them.

In my choice to suspend the objects from the ceiling at a variety of human ear-heights, I have been inspired by Heather Dewey-Hagborg’s installation Probably Chelsea from 2017. This work features thirty 3D-printed masks of human faces, each modelled on different samples of Chelsea Manning’s DNA, which Manning sent to the artist from prison. These masks and the empty spaces that ‘wear’ them evoke the spectre of Manning’s free presence during her incarceration, although they bear little resemblance to one another. In this way, Dewey-Hagborg’s work bridges the fields of art and science, re-purposing and twisting a problematic police forensic system to highlight the complexity of human identities. I wanted visitors to my installation to have a similar, human-scale relation to these replicated biological objects. The intimacy and individuality of a visitor’s encounter with them is heightened by the experience of having the objects speak directly to them.

The talk boxes which reproduce this intimate chatter are, like the printed vocal tracts, transformed in their intended use. Intended as pedals for electric guitar players, these devices are meant to propel sound waves through surgical tubing into the operator’s mouth, using the oral chamber as a dynamic acoustic filter. Although the talk box usually sits outside the discourse around vocal synthesis specially, it arguably belongs in this trajectory of technological development too, as it uses the vocal tract to make a kind of acoustic vocoder. By inserting the tube into the base of the model’s throat rather than the entrance to the mouth, in Mouthpiece the talk box performs its function upside-down, re-aligning it with the directionality of real vocal production (inside to out).

In Mouthpiece, my resolution was to make a speaking machine which performatively engages with many others of its kind, to reflect and distort the issues I saw emerging in the long-term development and use of these devices. If, as Judith Bulter (2009, iii) espouses, performativity is used reproduce power through norms, she also argues that there is an element of risk in this performance, that every act of reproduction “can and does go awry, undo the strategies of animating power, and produce new and even subversive effects.” In his text ‘The Lexicon of the Mouth’, Brandon LaBelle (2014, 70) draws on this aspect of Butler’s theory to highlight the potential of vocal nonsense and clownery of the mouth to challenge norms and power structures.

…clowning is a gestural vocabulary based on becoming other than (properly) human, an other that is most often at odds with productive mechanics—of labor and capitalism, of normative behaviour and gender identity. Clownery, in others words, articulates a vulnerable body. In doing so, it animates and reanimates the body in parts, a body coming undone.

Brandon LaBelle . The Lexicon of the Mouth : Poetics and Politics of Voice and the Oral Imaginary, 2014, 70.

Mouthpiece reflects and distorts vocal synthesis methods with results that are both funny and disturbing, like a warped reflection in an amusement park mirror. The vocal tract models, which had been objectified in their abstraction from Christine’s body, are reanimated through voice, reconnected with vulnerability through strangeness, fleshiness, and humour. In embracing the unruliness of this performance, I aim for Mouthpiece to be a kind of speaking machine which performs vocal synthesis ‘incorrectly’, thus challenging the norms and power relations engrained in the history of this practice.

Iterations

Beyond the initial exhibition at of Mouthpiece at CHB, in September 2022, the piece has been developed and presented as a performance at the residency space, venue and platform for experimental music Sonoscopia in Porto, Portugal.

In this iteration of the work, my own body takes the place of one of the 5 models. Using a combination of a talk-box and microphone directed into my mouth, a disconcerting asynchronicity between sound and image, as well as between natural and synthesised voice is built. My voice is ventriloquised live from the mouths of the four remaining models, while they continue to chatter and hum with the AI-generated voices which were present in the installation. These pre-recorded sounds are also directed into my mouth via a talk-box, filtered by the changing shape of my vocal cavity and then routed out to the models again. This complex flow of voices between natural and virtual sources is controlled in real time during the performance using the software Spat. When I route the signal from my microphone back into my mouth, my voice is transformed into feedback, producing a screeching, electronic regurgitation.

In this configuration of the piece, there is an ambiguous relationship between the models and my body. The sculptures could be interpreted as extensions of my flesh, as independent entities or as something in-between. While in the white cube gallery, the uncanny flesh of the models is starkly exposed in cool light, in the black box they loom ominously out of the darkness. I followed and played with this theatricality in my performance, allowing the strangeness of the voices passing through my mouth to feel like a possession, inverting the power relationship of operator and machine.

Conclusion

With this text, I have navigated through the sprawling histories, dynamics, questions, and practices which have informed and shaped Mouthpiece. Touching the age-old idea that the voice and subject are deeply intertwined, I have speculated that in the case of many contemporary vocal synthesis techniques, the speaking machine may retain the trace of the modelled speaker. In my choice to work with models of Christine’s vocal tract, I thus engaged with the idea that a part of her could be contained in the work. The question of whether this is a projection of my own is left open in the piece, but the installation remains imbued with this possibility through a sense of haunting, a strangeness engendered by the organic textures, the babbling machine-voices, and the uncanny forms from which they emanate.

As speaking machines become inconspicuously integrated into our lives, we greet them with an increasingly numb familiarity. As Steven Connor (2000, 411) articulates, after centuries of the disembodied voice being treated either as super-, sub- or in-human utterance, we seem to have become adjusted or naturalised to the synthetic voices around us. “We have been severed”, he argues, “not from our voices, but from the pain of that severance”. With my installation, I wish to turn our attention toward this place of partial severance of the voice, to introduce feeling where the numbness has taken root.

With this text, I also wish to cast a light on the forces at play behind this apathy. As an artist and an artistic researcher, it is important to me to not only express these ideas with my installation, but to write about the histories, relations and politics which have inspired it. I understand contemporary art practice and media archaeology as parallel endeavours, with the material, sonic, and spatial possibilities of artistic practice acting as a kind of performative writing, a way of making non-linear connections between past and present media apparent, by constructing entities which embody and enact these relations. The practice of writing about these connections helps to elucidate them in a more explicit way, building clarity in my own understanding of my studio practice, and creating a feedback loop of exploration and apprehension between written, sonic and material forms of knowledge. ¹¹

This text has considered speaking machines as entities performing the role of a subject, and highlighted that in this relation, othered identities are more often invoked to be voice of these non-subjects. The vocal identities invoked by speaking machines matter, not just because representation matters, but because the voice is a powerful tool to craft the likeness of a human and humans are very much more complex than our current algorithms can convey. This text and work Mouthpiece are experiments nestled in this thorny complexity. Through this piece and further iterations of it, I wish to play with a kind of ventriloquial identity that is not glossed over, exploited, or capitalised on, but which is but which is strange and vulnerable speculation, a fragile, pieced-together vocal skin made up of several selves.

References

Abu Hamdan, Lawrence. 2014. ‘Aural Contract: Forensic Listening and the Reorganization of the Speaking Subject’. In Forensis: The Architecture of Public Truth, edited by Caroline Sturdy Colls, 65–82.
Adnet, Dustin A. 2020. The American Robot: A Cultural History. Chicago and London: The University of Chicago Press.
Altick, Richard D. 1978. The Shows of London. Cambridge, Massachusetts: Belknap Press of Harvard University Press.
Aristotle. 1907. De Anima. Translated by R.D Hicks. Vol. Book II. Cambridge, Mass: Cambridge University Press
Bardhan, Ashley. 2022. ‘Men Are Creating AI Girlfriends and Then Verbally Abusing Them’. Futurism. 18 January 2022. https://futurism.com/chatbot-a....
Barthes, Roland. 1977. ‘The Grain of the Voice’. In Image - Music - Text, translated by Stephen Heath, 179–89. New York: Hill and Wang.
BBC News. 2020. ‘Mummy Returns: Voice of 3,000-Year-Old Egyptian Priest Brought to Life’, 24 January 2020, sec. Middle East. https://www.bbc.com/news/world....
Birkholz, Peter, Steffen Kürbis, Simon Stone, Patrick Häsner, Rémi Blandin, and Mario Fleischer. 2020. ‘Printable 3D Vocal Tract Shapes from MRI Data and Their Acoustic and Aerodynamic Properties’. Scientific Data 7 (1): 255. https://doi.org/10.1038/s41597-020-00597-w.
Brackhane, Fabian, Richard Sproat, and Jürgen Trouvain. 2017. ‘Editing Kempelen’s “Mechanismum Der Menschlichen Sprache”: Experiences and Findings’. In Proceedings of the Second International Workshop of the History of Speech Communication Research, edited by Martti Vainio, Juraj Šimko, and Reijo Aulanko, 16–24. Helsinki: TUDpress.
Bulut, Zeynep. 2011. ‘La Voix-Peux : Understanding the Physical, Phenomenal, and Imaginary Limits on the Human Voice through Contemporary Music’. University of California, San Diego. https://escholarship.org/uc/it...
Butler, Judith. 2009. ‘Performativity, Precarity and Sexual Politics’. Aibr-Revista De Antropologia Iberoamericana 4 (3): i–xiii.
Cavarero, Adriana. 2005. For More than One Voice: Toward a Philosophy of Vocal Expression. Translated by Paul A. Kottman. Stanford University Press.
Connor, Steven. 2000. Dumbstruck : A Cultural History of Ventriloquism. Oxford ; New York: Oxford University Press.
De Certeau, Michel. 1996. ‘Vocal Utopias: Glossolalias’. Translated by Daniel Rosenberg. Representations, no. 56: 29–47.
Eidsheim, Nina Sud. 2019. The Race of Sound : Listening, Timbre, and Vocality in African American Music. Duke University Press. https://doi.org/10.2307/j.ctv1....
Feldman, Jessica. 2016. ‘The Problem of the Adjective: Affective Computer of the Speaking Voice"’. Transposition 6. https://doi.org/10.4000/transposition.1640
Fleur, Nicholas St. 2020. ‘The Mummy Speaks! Hear Sounds From the Voice of an Ancient Egyptian Priest’. The New York Times, 23 January 2020, sec. Science. https://www.nytimes.com/2020/0....
Hankins, Thomas L., and Robert J. Silverman. 1995. Instruments and the Imagination. Princeton, New Jersey; Chichester, West Sussex: Princeton University Press.
Hollingshead, John. 1895. My Lifetime. Vol. 1. London: Sampson Low, Marston & Company.
Hope, Maxwell, and Jason Lilley. 2022. ‘Gender Expansive Listeners Utilize a Non-Binary, Multidimensional Conception of Gender to Inform Voice Gender Perception’. Brain and Language 224 (January): 105049. https://doi.org/10.1016/j.band....
Howard, D. M., J. Schofield, J. Fletcher, K. Baxter, G. R. Iball, and S. A. Buckley. 2020. ‘Synthesis of a Vocal Sound from the 3,000 Year Old Mummy, Nesyamun “True of Voice”’. Scientific Reports 10 (1): 45000. https://doi.org/10.1038/s41598...
LaBelle, Brandon. 2014. The Lexicon of the Mouth : Poetics and Politics of Voice and the Oral Imaginary. New York; London: Bloomsbury Academic.
LOVO 2022a ‘Free Text to Speech Online with Natural Voices - Homepage’. LOVO AI - Latest. Accessed 14 June 2022. https://www.lovo.ai.
——— 2022b ‘Studio | Free Text to Speech Online with Natural Voices’. LOVO AI - Latest. Accessed 14 June 2022. https://www.lovo.ai/studio.
——— 2022c ‘Love Your Voice’. Accessed 14 June 2022. https://studio.lovo.ai/customv...
Luciani, Davide. 2022. ‘Prophecy of the Ineffable: The Ambivalent Politics of the Affect in Machine Listening’ [Unpublished manuscript]. Berlin: Universität der Künste Berlin.
Melen, Christopher. 2020. ‘A Short History of Neural Synthesis’. Royal Northern College of Music. 22 May 2020. https://www.rncm.ac.uk/researc....
Peters, Laura. 2020. ‘The Limits of the Human? Exhibiting Colonial Orphans in Victorian Culture’. In Rereading Orphanhood, by Diane Warren and Laura Peters, 142–66. Edinburgh University Press.
‘PRiSM SampleRNN’. n.d. Royal Northern College of Music. Accessed 10 June 2022. https://www.rncm.ac.uk/researc....
‘Replika’. 2022. Replika.Com. https://replika.com. Accessed 10 October 2022
ReplikaSingularity. 2020. ‘Anyone Tried Being Abusive?’ Reddit Post. R/Replika. https://www.reddit.com/r/repli...
Riskin, Jessica. 2003a. ‘Eighteenth-Century Wetware’. Representations 83 (1): 97–125.
———. 2003b. ‘The Defecating Duck, or, the Ambiguous Origins of Artificial Life’. Critical Inquiry 29 (4): 599–633.
Robison, Caitlin Chin and Mishaela. 2020. ‘How AI Bots and Voice Assistants Reinforce Gender Bias’. Brookings. 23 November 2020. https://www.brookings.edu/rese....
Schafer, R. Murray. 1997. The Soundscape: Our Sonic Environment and the Tuning of the World. Rochester, Vermont: Destiny Books.
Sebes, Lottie. 2020. ‘The Sounding Sewing Machine: Re-Voicing Gendered Media Histories’. Sonic Scope: New Approaches to Audiovisual Culture, October. https://doi.org/10.21428/66f84...
Specia, Megan. 2019. ‘Siri and Alexa Reinforce Gender Bias, U.N. Finds’. The New York Times, 22 May 2019, sec. World. https://www.nytimes.com/2019/0....
Squires, Bethy. 2022. ‘TikTok Users Are Taking the New Voice Filter to the Absolute Limit’. Vulture. 31 October 2022. https://www.vulture.com/2022/1...
Taylor, Amiah. 2022. ‘Men Are Creating AI Girlfriends, Verbally Abusing Them, and Bragging about It on Reddit’. Fortune. 19 January 2022. https://fortune.com/2022/01/19....
The Henry Ford Museum of American Innovation. 2019. 'Thomas Edison’s Talking Dolls | The Henry Ford’s Innovation Nation'. YouTube Video, August 29, 2019, 3:47 https://youtu.be/BTWaAM_FqlU

Imprint

Author

Lottie Sebes

Issue

Date

15 February 2023

Footnotes

1 This statement is an edited version of a text submitted in partial fulfilment of the requirements for the degree Master of Arts: Sound Studies and Sonic Arts at the Universität der Künste Berlin. It served as extended documentation of an artistic project, supervised by Hans Peter Kuhn and Viktoria Tkaczyk.
2 While scientists studying physical acoustics at this time were often focussed on reproducing the acoustic phenomena of the voice rather than its anatomical mechanisms, the was no sharp line dividing these camps and a trajectory of continued interest in the vocal tract perpetuated (Hankins and Silverman 1995, 209).
3 It is not clear from the published study how Nesyamun’s personal beliefs have been accounted for here. It seems that this has been (somewhat problematically) inferred from the ancient Egyptians’ general belief that “to speak the name of the dead is to make them live again” (Howard et al. 2020, 4).
4 The DVTD lists five other studies which have published MRI data of the vocal tracts of Japanese, English, French and Finnish speakers. The DVTD uniquely provides data for the 22 German speech sounds and, unlike most other publicly available studies, also provides extensively processed and evaluated data “to make them accessible to non-experts on volumetric MRI processing” (Birkholz et al. 2020, 2–3). The study lists a range of potential applications for the data (not including art installations) such as the creation of articulatory models, the validation of computational models, the assessment of the acoustic effects of certain geometric features of the vocal tract, the use as teaching tools or to study questions of morphology, anatomic development, and vocal tract variability between different speakers (ibid.).
5 This is a summary of a conversation in German between me at the ‘female subject’ from the Dresden Vocal Tract Data Set. Christine said she was open to having this discussion with me after I asked permission to contact her via Dr. Birkholz, the primary researcher behind the DVTD project. Her name has been changed.
6 Unfortunately, studies concerning gender perception and voice have been largely limited to a binary conceptualisation of gender, by examining stereotypically masculine or feminine voices, recruiting only male or female participants or even asking for participants to categorise non-binary voices as male or female. A recent study has found that “a more complex and non-exclusive conception of gender could lead to an increase in speech perceptual flexibility” (Hope and Lilley 2022, 8).
7 I have also discussed this in a previous paper (Sebes 2020, 12, 19)
8 ‘Voice skin’ is a term that is slowly making its way into the market, with several other companies such as Animaze and Voicemod offering services under the same name. Animaze is targeted at online content creators who wish to present themselves as an animated avatar, while Voicemod is marketed at gamers wanting to embody the characters they play online more fully, or perhaps to better hide their true identities. Both allow for real-time voice modification online. That said, this technology is developing very quickly and the 'voice skin' terminology does not appear to be sticking. In October 2022, (in the time between first writing this paper and publishing it) Tiktok expanded the range of their audio effects to allow users to change the character of the voices in their videos, including some such as 'Jessie' and 'Deep' which allow "people to sound like a text-to-speech robot is coming out of their mouths” (Squires, 2022). So far this feature is typically referred to by users as a ‘voice changer’ or ‘voice filter’. These filters can impressively adapt in real time to suit the accent of the human speaker and even follow along with many extra-linguistic sounds such as breath.
9 This is open-source software, developed at the Centre for Practice & Research in Science & Music, part of the Royal Northern College of Music in the United Kingdom and released on GitHub in 2020. It was developed by Sam Salem and Christopher Melen, initiated by Sam Salem and coded by Christopher Melen. For more information on the development and functioning of Sample RNN see Melen 2020.
10 Contemporary art-science collaborations which incorporate bio-matter and living entities as materials or even collaborators into the art making process could also be seen as a re-emergence of this eighteenth-century fascination with wetware.
11 Readers who have also discovered my previous artistic research paper 'The Sounding Sewing Machine: Re-voicing Gendered Media Histories' (Sebes 2020) will notice that many of the devices I mention in section three of this paper (Euphonia, Edison's talking dolls, Siri, and Alexa) are also discussed there. That was also in the project in which my unorthodox work with talk-boxes began. Rather than starting from scratch, the current project naturally incorporates my past learning and passions, building on and expanding my artistic research practice with each artwork and text.

Back to issue overview

Share this article

Leave reply

Your browser does not meet the minimum requirements to view this website. The browsers listed below are compatible. If you do not have any of these browsers, click on the icon to download the desired browser.

Google Chrome

Mozilla Firefox

Microsoft Edge