Trainable frontend for robust and far-field keyword spotting
Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F. Lyon, Rif A. Saurous, “Trainable frontend for robust and far-field keyword spotting.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5670–5674, 2017.
Article permalink: https://doi.org/10.1109/ICASSP.2017.7953242
@inproceedings{wang2017trainable,
title={Trainable frontend for robust and far-field keyword spotting},
author={Wang, Yuxuan and Getreuer, Pascal and Hughes, Thad and Lyon,
Richard F and Saurous, Rif A},booktitle={2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP)},pages={5670--5674},
year={2017},
organization={IEEE}
}
Abstract
Robust and far-field speech recognition is critical to enable true hands-free communication. In far-field conditions, signals are attenuated due to distance. To improve robustness to loudness variation, we introduce a novel frontend called per-channel energy normalization (PCEN). The key ingredient of PCEN is the use of an automatic gain control based dynamic compression to replace the widely used static (such as log or root) compression. We evaluate PCEN on the keyword spotting task. On our large rerecorded noisy and far-field eval sets, we show that PCEN significantly improves recognition performance. Furthermore, we model PCEN as neural network layers and optimize high-dimensional PCEN parameters jointly with the keyword spotting acoustic model. The trained PCEN frontend demonstrates significant further improvements without increasing model complexity or inference-time cost.
©2017 by the Institute of Electrical and Electronics Engineers (IEEE). Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from IEEE.