On Personalizing “Hey Siri”

From Apple’s Machine Learning Journal, in a piece about what goes on behind the scenes on your devices when you say “Hey Siri”:

We designed the always-on “Hey Siri” detector to respond whenever anyone in the vicinity says the trigger phrase. To reduce the annoyance of false triggers, we invite the user to go through a short enrollment session. During enrollment, the user says five phrases that each begin with “Hey Siri.” We save these examples on the device.

We compare any possible new “Hey Siri” utterance with the stored examples as follows. The (second-pass) detector produces timing information that is used to convert the acoustic pattern into a fixed-length vector, by taking the average over the frames aligned to each state. A separate, specially trained DNN transforms this vector into a “speaker space” where, by design, patterns from the same speaker tend to be close, whereas patterns from different speakers tend to be further apart. We compare the distances to the reference patterns created during enrollment with another threshold to decide whether the sound that triggered the detector is likely to be “Hey Siri” spoken by the enrolled user.

This process not only reduces the probability that “Hey Siri” spoken by another person will trigger the iPhone, but also reduces the rate at which other, similar-sounding phrases trigger Siri.

I found this whole thing very interesting, even as I am not experienced in the ways of machine learning. I found it particularly interesting because of something that happened last week: My wife and I were sitting on the couch and I used “Hey Siri” for something. Out of curiosity, I checked to see if it triggered hers, and indeed it did not. With my iPhone, iPad, and Apple Watch at the ready, I had her try to trigger my devices multiple times, with no success.

It’s neat to see what goes into helping Siri reduce the chances of these false activations. Granted, my wife is a female with a slight Mexican accent (only very slightly). The chance of false activation would be higher with another male speaker, I imagine, but the fact that it is able to store and use the enrollment examples to cut down on this is still really cool.

Leave a Reply

Your email address will not be published. Required fields are marked *