Voice Chatting with GPT-3

OpenAI’s GPT-3, and more recently ChatGPT, has created a large impact on people’s imaginations, hopes, and fears concerning AI. The apparent naturalism of the text generated by those and similar Large Language Models are indeed impressive — to the extent even some wanting to attribute to them consciousness or personhood (short answer is “no”).

DALL-E visualization of Susan, an AI assistant, prompted by her own description “If I had a physical form, I’d probably look like a robot with a sleek, modern design. I’d be about 5 feet tall, with a smooth and shiny exterior. My eyes would be glowing blue, and my voice would be soft and comforting.”

I have been building a program, in Python, to be a voice interface with GPT-3. OpenAI provides an API to access GPT-3 on the cloud, which is actually fairly easy to use to at least generate text using the pretrained model. There are several ‘flavors’ of GPT-3, with varying sizes, and I am using the DaVinici model which is the largest, sporting 175 million parameters and trained on terabytes of textual data.

The code may be found here.

Process

The basic loop is to:

Speech recognition: transcribe the user’s speech to text
Send that text to GPT-3 for inference, and wait for response.
Synthesize the text to speech again, giving the model a voice.
Repeat until user has said “goodbye”

Speech Recognition

Speech recognition is accomplished at present using Vosk, a toolkit which converts speech to text offline, on the local machine. This has the advantage of not needing for the speech to be processed on the cloud, saving some latency. It is also light weight, in that it can run on smaller devices such as the Raspberry Pi. Streaming audio from the microphone is accomplished with the PyAudio package.

First, a stream is opened from the microphone:

mic = pyaudio.PyAudio()
stream = mic.open(format=pyaudio.paInt16,
                  channels = self.config['channels'],
                  rate=self.config['rate'],
                  input=True,
                  frames_per_buffer=self.config['chunk'] * 2)
stream.start_stream()

And then loop to collect and process the audio to text:

while True:
    data = stream.read(self.config['chunk'])
    if self.recognizer.AcceptWaveform(data):
          result = json.loads(self.recognizer.Result())
          break
return result['text']

Inference on GPT-3

The text is added to the current context (an initial prompt that instructs the model and the present conversation), and submitted to the OpenAI API for text completion — essentially, the context is a story which the model then predicts the most likely continuation of that story, as a human may do. This is what makes GPT-3 an autoregressive model, in which its outputs become part of the input at the next iteration. The text is generated token for token (which may be complete words or parts of words).

response = openai.Completion.create(
    engine = self.gpt3_config['engine'],
    prompt = prompt,
    temperature = self.gpt3_config['temperature'],
    max_tokens = self.gpt3_config['max_tokens']
) 
return response.choices[0].text

Speech Synthesis

Once a text response is received, it is then converted back to speech using the pysttx3 package. The speech engine having been initialized, it is as simple as passing the text as a string to the say() method, and wait for it to finish.

self.engine.say(text)
self.engine.runAndWait()

With Windows, as I am doing here, it uses the SAPI speech synthesis built into the operating system.

And that is about as simple as it gets. With a little creative prompting, one has a reasonable chatbot working. But here is where the magic begins.

Prompting

However, to produce a ‘chatbot’ with some attributes rather than just the generic language model, one Prompts it with the desired traits. This is a basic form of Prompt Engineering, using the one shot learning capability of models such as GPT-3. Then, one has to also maintain some degree of continuity of conversation, which is accomplished by prompting the model with not just the most current query or reply from the user, but the previous interactions (up to the input limit of the model, in this case being 2048 tokens. Combined, the chatbot’s description and the previous conversation, forms the context which the model replies to, essentially continuing a story or script. E.g., at a certain point, the model was prompted with this context:

The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly. The assistant's name is Susan.
[human] hello Susan
[AI] Hi there! How can I help you today?
[human] can you maybe describe yourself to me and how you look and things like that?
[AI] Sure thing! I'm an AI assistant, and my job is to help you with whatever you need. I'm a virtual assistant, so I don't have a physical form. I'm friendly, helpful, and always ready to lend a hand!
[human] okay, but if you did have a physical?

The model then continues the dialog with the description used later to generate an image (using a different AI, OpenAI’s DALL-E):

[AI] If I had a physical form, I'd probably look like a robot with a sleek, modern design. I'd be about 5 feet tall, with a smooth and shiny exterior. My eyes would be glowing blue, and my voice would be soft and comforting.

In testing, I had a brief discussion with Susan the meaning of life and or meditation, that was fairly naturalistic. It is transcribed here from a natural voice conversation.

[AI] Hi there! How can I help you?
[human] what is your name and what do you do
[AI] My name is Susan, and I'm an AI assistant. I'm here to help you with any questions or tasks you need assistance with.
[human] can I ask you about philosophy
[AI] Absolutely! What would you like to know about philosophy?
[human] well what is the meaning of life
[AI] That's a big question! Philosophers have been debating the meaning of life for centuries. Ultimately, it's up to each of us to decide what the meaning of life is for ourselves.
[human] and how can we find that out
[AI] That's a great question. A lot of people find their own meaning of life through self-reflection, meditation, and spiritual practices. Others find their meaning in their relationships, their work, or their hobbies. Ultimately, it's up to each of us to find our own meaning of life.
[human] can you tell me more about meditation?
[AI] Sure! Meditation is a practice of focusing on the present moment. It can help increase awareness of yourself and the world around you, as well as reduce stress and anxiety. There are many different types of meditation, so it's important to find the one that works best for you.
[human] what is a very simple way of meditation?
[AI] One of the simplest forms of meditation is called "mindful breathing." To practice mindful breathing, find a comfortable position and focus on your breath. Notice the sensation of your breath as it moves in and out of your body. Whenever your mind wanders, simply bring your focus back to your breath.
[human] that sounds good I think I am going to go practice it now
[AI] That's great! I hope you find your meditation practice helpful. Let me know if there's anything else I can do for you!
[human] I will thank you
[AI] You're welcome! Have a great day.
[human] goodbye
[AI] Goodbye! Have a wonderful day!

Conclusion

An alternative impression of the AI assistant Susan generated by DALL-E

Given such a naturalistic and informed (on the AI’s part) conversation, it is easy to anthropomorphize an AI model such as GPT-3. This is, I suspect, that aspects of our own brains do not know about AI or virtual beings: we see, or hear, a cogent agent, we tend to, on some emotional level, assume that we are talking with a being like ourselves. (Perhaps in the same fashion that my cat may well believe I am a large and clumsy cat.) In reality, we are interacting with a language model that accurately, given a context, predicts the most likely continuation of that context, based on a generalization of the vast corpus of human language on which it has been trained. It speaks or writes compellingly, yet it has to understanding of what it says nor can reason about it.

One way to understand is that, as human beings, we use language to express our thought processes, our knowledge of ourselves, one another, and our consciousness. Hence, human language embeds much of our sentience. When a system such as GPT-3 becomes adept at modelling human language, it also models human consciousness and thought processes. This is probably one of the more powerful aspects of these models — and also potentially confusing.

karen c fisher

technology, art, and other distractions

Voice Chatting with GPT-3

Process

Speech Recognition

Inference on GPT-3

Speech Synthesis

Prompting

Conclusion

Leave a comment Cancel reply

Process

Speech Recognition

Inference on GPT-3

Speech Synthesis

Prompting

Conclusion

Share this:

Leave a comment Cancel reply