Oxford University’s lip-reading AI is more accurate than humans, but still has a way to go

A new paper (pdf) from the University of Oxford (with funding from Alphabet’s DeepMind) details an artificial intelligence system, called LipNet, that watches video of a person speaking and matches text to the movement of their mouth with 93.4% accuracy.

“The technology has such obvious utility, though, that it seems inevitable to be built,” Clark writes. Teaching AI to read lips is a base skill that can be applied to countless situations. A similar system could be used to help the hearing-impaired understand conversations around them, or augment other forms of AI that listens to video sound and rapidly generate accurate captions.

We proposed LipNet, the first model to apply deep learning for end-to-end learning of a model that maps sequences of image frames of a speaker’s mouth to entire sentences. The end-to-end model eliminates the need to segment videos into words before predicting a sentence. LipNet requires neither hand-engineered spatiotemporal visual features nor a separately-trained sequence model.

Our empirical evaluation illustrates the importance of spatiotemporal feature extraction and efficient temporal aggregation, confirming the intuition of Easton & Basala (1982). Furthermore, LipNet greatly outperforms a human lipreading baseline, exhibiting 7:2 better performance, and 6 : 6% WER, 3 lower than the word-level state-of-the-art (Wand et al., 2016) in the GRID dataset.

While LipNet is already an empirical success, the deep speech recognition literature (Amodei et al.,2015) suggests that performance will only improve with more data. In future work, we hope to demonstrate this by applying LipNet to larger datasets, such as a sentence-level variant of that collected by Chung & Zisserman (2016a).

Some applications, such as silent dictation, demand the use of video only. However, to extend the range of potential applications of LipNet, we aim to apply this approach to a jointly trained audio-

visual speech recognition model, where visual input assists with robustness in noisy environments.

Link to paper | http://www.oxml.co.uk/publications/2016-Assael_Shillingford_LipNet.pdf

Regards

Pralhad Jadhav

Senior Manager @ Library

Khaitan & Co

Blog | http://pralhad-fyilibrarian.blogspot.in/

Website | https://sites.google.com/site/pralhadjadhavlib/home