Oxford University’s lip-reading AI is more accurate than humans, but still has a way to go
A new paper (pdf) from the
University of Oxford (with funding from Alphabet’s DeepMind) details an
artificial intelligence system, called LipNet, that watches video of a person
speaking and matches text to the movement of their mouth with 93.4% accuracy.
“The
technology has such obvious utility, though, that it seems inevitable to be
built,” Clark writes. Teaching AI to read
lips is a base skill that can be applied to countless situations. A similar
system could be used to help the hearing-impaired understand conversations
around them, or augment other forms of AI that listens to video sound and
rapidly generate accurate captions.
We proposed LipNet, the first model to apply deep
learning for end-to-end learning of a model that maps sequences of image frames
of a speaker’s mouth to entire sentences. The end-to-end model eliminates the
need to segment videos into words before predicting a sentence. LipNet requires
neither hand-engineered spatiotemporal visual features nor a separately-trained
sequence model.
Our empirical evaluation illustrates the importance of
spatiotemporal feature extraction and efficient temporal aggregation,
confirming the intuition of Easton & Basala (1982). Furthermore, LipNet greatly
outperforms a human lipreading baseline, exhibiting 7:2 better performance, and
6 : 6% WER, 3 lower than the word-level state-of-the-art (Wand et al., 2016) in
the GRID dataset.
While LipNet is already an empirical success, the deep
speech recognition literature (Amodei et al.,2015) suggests that performance
will only improve with more data. In future work, we hope to demonstrate this
by applying LipNet to larger datasets, such as a sentence-level variant of that
collected by Chung & Zisserman (2016a).
Some applications, such as silent dictation, demand the
use of video only. However, to extend the range of potential applications of
LipNet, we aim to apply this approach to a jointly trained audio-
visual speech recognition model, where visual input
assists with robustness in noisy environments.
Regards
Pralhad
Jadhav
Senior
Manager @ Library
Khaitan & Co
No comments:
Post a Comment