Visual Voice Assistant with Computer Vision

This project was created circa June 2018 – APL was released September 20th 2018 – Google’s “Look and Talk” feature was released in May 2022

Summary-

This project is a visual voice assistant that uses creator-generated video and OpenCV (an open source computer vision library) to create a more streamlined interaction with intelligent personal assistants by leveraging user human eye contact to signify they are ready for a conversation. Most conversation experts would agree that body language is a large part of how we communicate, I wanted to leverage this. By responding with video instead of just sound, creators of these experiences can use all of the techniques of body language, editing and filmmaking to their advantage. This project was heavily influenced by the thinking at the Public Interactives Research Lab (PIRL) about what these devices would do in a public space, and by Associate Dean Dale MacDonald‘s vision of using a network of voice assistants for public signage and navigation.

Introduction –

My project was in part, a reaction to the frustrations I had with the interactions that I had with voice assistants in my home. They seemed to be really limited in their capabilities to communicate with me, especially when you consider that almost everyone is already accustomed to voice as a way of interfacing with the world. It’s how you hail a taxi and how you check out at the grocery store or a restaurant, however, even these interactions which used to be handled with a person to person voice exchange are now being taken over by the devices in our pockets. Simultaneously, while voice is being phased out in many facets in our lives it has entered our homes. Unfortunately, I have been disappointed with the kind of conversations I’ve been having with these devices. Perhaps its because I only have these real world conversations (and Jarvis from Iron Man) to compare them to. I wanted to make these robotic conversations closer to the human conversations I was having in public spaces.

Existing assistants had a couple problems that would prevent them from being a good match for public space.

– The robotic voice carries little to no emotion. Similar to listening to a monotone lecturer, it makes it very hard to engage with the content being delivered. There are SSML tags that can be added to the text to speech that add prosody, but they must be hand coded in.

– Visually there is nothing evocative from a puck. In a home environment this is perfectly fine, preferred even, that your voice controls are almost invisible, like magic. But in a public space, this invisibility isn’t good at garnering attention and will render the system useless for spacial interactions.

– The wake word. I find it cumbersome that every time I want to talk to one of these devices I have to say the wake word that I learned when I got the device. Using a wake word in between sentences prevents easy flowing continuous conversation, and assumes that the user knows the name of the device making it a bad fit for a public space.

My system for a voice assistant in public space used creator generated video displayed on a screen to give responses a human voice and a physical presence in the area that in inhabits. Additionally, it uses a camera to give it computer vision. The camera stores no data and uses live face detection to tell when someone is facing towards the screen, this sends a signal to the rest of the system that processes user speech into text. This facial recognition replicates the effect of giving someone eye contact to let them know you’re talking to them.

Visual Voice Assistant with Computer Vision

Components-

The video component to this project needs a couple core elements to make this a working system.

Passive video- This needs to play when the voice assistant is not responding to anything and just waiting.

Listening video- A video that shows that the voice assistant is listening to the person interacting with it.

Introductory response – When someone uses your voice assistant for the first time, they won’t know what they can ask it. This is the video clip that helps them interact with your voice assistant. Does your assistant have a name it can be called with? What is the scope of this assistant, does it just take orders, or can it do more? This video should help ease users into this experience. (This was added after user testing)

Default response- What the assistant responds with when it doesn’t know how to respond to a question or statement.

(Situational) Calling in for human help- Sometimes the voice assistant can’t help with everything and a human might need to step in to help. Programming in a notification system that contacts someone would help you resolve such problems. “Sorry, I’m not sure I can help you with that. Would you like me to get someone who can help?” Combining this with the default response might be best depending on the situation. No user tests have been conducted on this.

The vision of the visual voice assistant shown in this project was limited to just recognizing user eye contact, but by using computer vision (CV) this can be expanded to other forms of gesture recognition as well as using CV to read facial expressions track eye position. These may be useful skills for a vendor or supplemental instructor application etc. The following is a network of all the processes going on in a vending machine application of this system.

Network-

Hardware-

Raspberry pi – One for OpenCV activation

Raspberry pi – One for The Video response AI

Google AIY voice kit – This includes a microphone, Voice HAT, speaker and tactile button containing an led

Webcam

Screen

Software-

This project was coded in python

OMX player for video playback

Python OMX wrapper

OpenCV for face detection

Google AIY voice library

User testing-

I realize for user testing, it is best to have party separate from the development do the testing, but unfortunately, I didn’t have a usability team. It was just me and the business students I was working with for Big Idea competition.

For the user testing of this project I reserved the Usability testing lab at ATEC. I had three random users, all students at UTDallas. The participants were told to imagine that they were walking up to a conversational vending machine that can only sell “a drink or a snack” then I would walk away and let them interact with the machine unimpeded. However, I had to step in after I realized that my interaction was not nearly as intuitive as I believed it to be. Users would ask how it was doing, what it’s favorite color was, what it was doing, but did not ask whether they could get a drink or a snack, so the system was unprepared and spit out the default message, “I’m sorry I don’t understand that”. The body language for listening to the user displayed by the system were not a strong enough signifiers for people to initiate a conversation with the device. They needed to be greeted with an explanation of what they can do in this space, similar to going to McDonald’s and being asked “How may I take your order” to prompt an order. Even after revealing that there was an affordance for a voice interaction, users still found the interaction difficult. I had asked “After knowing that you could talk to the vending machine, why didn’t you ask it for a drink or a snack?” and they responded “I forgot” users also told me they don’t talk to vending machines, so this interaction was new to them.

The image above is a screenshot from the user testing footage taken from this day. In the picture, circled in red, is a microphone that the user is speaking into, circled in blue is the microphone that the device is listening to. I had given the user a mic so that I could hear her in the testing room where this picture was taken. This showed me that I had to be more deliberate with props for future testing.

I had decided to end the user tests early because during testing with these three users the system had crashed numerously (within 5-10 minutes after every reset) and there were connection issues in the testing room making the cloud computing for voice recognition unreliable. Even though my team had prepared more polished video for the next round of user testing, the bug that caused the system to crash unexpectedly was never fixed, despite me trying. There have been no further user tests on this system.

Accessibility-

One of the features that I had talked to my team about was about accessibility. One of the advantages of this multi-modal experience was that it engaged people with video and audio that could be edited with a screen. As a result, this system could utilize an actor that could speak multiple languages, or sign, as well as use subtitles to annotate what was being said by the Conversational agent. This could be a huge advantage over a human vendor that may only speak English. Even if this system may not have a full library for every language, it would be reassuring to tell users that this machine does not understand every language in the language that was being directed at the machine.

Shown above is some of the footage we had planned to input into the system, this included sign language and subtitles for everything the agent would be saying.

Conclusion-

I had a vision for a visual voice assistant that might be able to take over jobs that were already being phased out by touch screens. This system would use the human image and use computer vision to create an easy interaction that would be familiar to the interactions people already have with their local clerks and cashiers. However, after working diligently on this project for several months I realized that there would be much more work that would need to go into this project to make it functional and useful; Whether that be coding, filming, editing, marketing, acting, engineering or user testing it. If I were to continue this project mostly alone this would take years for me to complete, and may even be outdated tech by the time it is complete. Additionally, similar technology had come out after/around the time I had been working on this for the commercial public. The Amazon Echo Show paired with Amazon Presentation Language would help me create similar video interactions (minus the computer vision of my system) with very minimal bugs. This had got me interested in what kind of interactions I could make for these devices for consumer use. For these reasons, this project has been put on hold for the foreseeable future.

Featured-

https://atec.utdallas.edu/event/watering-hole-nov-2/