Talking to Digital Assistants Is Weird. What Can We Do About It?

It would be interesting to see actual usage data on voice-activated systems and devices such as Apple’s Siri, Amazon’s Alexia, Google’s Assistant, and Microsoft’s Cortana. I suspect actual usage per user is very low.

From a user experience perspective, there are some fundamental issues with talking to a bot. First, it’s an awkward thing to do in public. Humans are social creatures. Walking around and seemingly talking to yourself has never been that cool.

Second, how do you talk to a bot, especially in public? Siri is not a real person (promise!) and you don't have to talk to her politely, yet a lot of people do: "hey siri, can you please tell me the weather forecast for tomorrow?". On the other hand, if you would walk around and command Siri like a dog, would that be a socially acceptable behavior?   

Third, it’s not just about appearing weird, it’s also about privacy. Unlike on Facebook and Instagram that are heavily curated and hands-off in the sense of the audience not actually being there, in real life people don’t want other people to know what they’re up to, what they're curious about, and what the weather is like in the twin cities.

This could potentially also suggest why Alexa is the perhaps most useful and also most used — question mark — of these devices. First, although Amazon is desperately trying to find other settings for it, Echo devices are typically used in your home. As you’re not in a public setting, it might make you a bit more relaxed about what you say and how you say it. Second, Alexa can meaningfully connect to some of the infrastructure in your home and bring it to life. I would suspect anyone who has an Echo device and any form of smart lights have hooked them up together and are using them as the killer demo for friends and family. How useful or more convenient or faster it really is though, well. It’s a good demo.

The rise of the speaking assistants over the last 5-10 years is interesting from a lot of different perspectives. One of them is that it is something that has been driven mainly by technological advances, not by user needs. However, as users don't always know what they want to have in the future (and no, I'm not going to quote Henry Ford here), that's fine with me. What I'm a bit more curious about is the way the speaking assistants' designers imagine the actual interaction between the human and the system. 

The intended interaction is a form of dialogue-based interaction, or an interaction style, that research in Human-Computer interaction (HCI) has shown has a number of very clear disadvantages to some other types of interaction styles, including the lack of visual exploration made possible by direct manipulation type interfaces (and a few advantages, too, many of which are unfortunately lost when you’re not typing your commands but speaking them out).

The most obvious problem with a dialogue based interface is that the capabilities of the system, i.e. what for instance Siri can actually do for you, is hidden from you as a user. You are the one that have to tell Siri what you want her to do for you. This is a great idea in theory, but how do you know what she actually can do? Like most users, you probably start guessing and probing the system with things you think it might know. Unfortunately, Siri often replies with “I didn’t quite catch that” or she simply tells you that she can’t do that, or, she has no idea what’s going on and just performs a web search for you hoping she gets away with it.

If you instead have a visual interface, say a menu on a website, the capabilities the system holds — i.e. what the menu can help you do — are visible to the user. You can glance over them until you find what you are looking for and then tap or click on that.

With voice, things are different. The system must either tell the user what choices are available (as typical phone menu systems do, i.e. “Press 1 to….”) which is very inefficient as you typically need to listen to all the choices before you pick, making glancing or quickly browsing more or less impossible, or — which is the choice of the digital assistants — you don’t really give the user anything and expect them to figure it out themselves. The idea here is that the digital assistant should be clever enough to figure out the user’s imperfect commands — from a system perspective — and be able to act on it anyway. So far, this isn’t really happening.

It would be fascinating to study the history of the digital assistants and why they have such a strong following in the tech side of user experience, particularly in computer science. Apart from being an interesting computational challenge involving natural language processing and machine learning, I think Star Trek, 2001: A Space Odyssey, and Dick Tracy might be worth investigating.

So then, what can be done to them to make them more usable?

Until it’s been released, it’s of course difficult to say what Samsung’s new digital assistant Bixby really is. Hopefully, though, it will be different from Apple’s Siri, Google Assistant, Amazon’s Alexa, and Microsoft’s Cortana, so that at the very least they are not all trying to do the same thing.

In my view, one way in which voice could actually be useful is for what we can call hand, eye, and voice coordination. The idea here would be that you could operate your smartphone, tablet, or PC by grabbing on to some on-screen object using touch or the mouse pointer and then simultaneously tell the system to perform an operation on the thing or things that you’ve selected. For instance, you could tap and hold a picture in a gallery and tell the system to rotate it. Another thing voice could be useful for is to bring to the screen objects that aren’t there: for instance, in a vector graphics application you could say “add circle”.

This could be a very powerful way to make more complex interaction possible on smartphones without resorting to interface ideas such as contextual menus that don’t really work well in non-workstation settings. I don’t know if this is the way Samsung’s Bixby will operate, but I hope so. In a research project together with ABB, we have explored a very similar notion in some depth by combining a visual interface with eye tracking and gesture recognition. This was intended for an industry control room, which is typically a very complex environment when it comes to interaction. In our prototype, the user is able to gaze on an object visible on one of several screens and while focusing on that object using gaze, the user is able to manipulate it in various ways using gestures. The video is below. Read more about this project here:


Update (March 29): The S8 has now been released and Samsung thinks of the Bixby as "a bright sidekick." There's also a dedicated hardware button for it, indicating that Samsung believes in it. Will be very interesting to see what it can do and how it differs from the other voice-driven assistants.