Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces (1805.10190v3)

Published 25 May 2018 in cs.CL and cs.NE

Abstract: This paper presents the machine learning architecture of the Snips Voice Platform, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices. The embedded inference is fast and accurate while enforcing privacy by design, as no personal user data is ever collected. Focusing on Automatic Speech Recognition and Natural Language Understanding, we detail our approach to training high-performance Machine Learning models that are small enough to run in real-time on small devices. Additionally, we describe a data generation procedure that provides sufficient, high-quality training data without compromising user privacy.

Citations (795)

View on Semantic Scholar

Summary

The paper demonstrates an embedded SLU system that delivers efficient ASR and NLU performance on microprocessors while safeguarding user privacy.
It employs a hybrid NN/HMM acoustic model and dynamic language model composition to optimize speed, memory, and accuracy for edge devices.
Performance evaluations on SmartLights and Weather models validate its robust generalization and on-device personalization capabilities.

An In-depth Analysis of the Snips Voice Platform for Embedded Spoken Language Understanding

The paper "Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces" presents a comprehensive analysis of the Snips Voice Platform, an advanced software solution crafted to deliver Spoken Language Understanding (SLU) on microprocessors integral to IoT devices. The framework is meticulously designed to emphasize privacy by design, ensuring that no personal user data is collected, a crucial feature differentiating it from typical cloud-dependent voice interfaces. This essay explores the intricate details of the paper, focusing on the technical aspects of the system's architecture, the performance metrics, and the broader implications of the research.

Technical Architecture of the Snips Voice Platform

The Snips Voice Platform aims to deliver high-performance SLU with minimal computational resources, making it suitable for embedded systems. The architecture leverages both Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) engines.

Acoustic Model Training

The acoustic model is fundamental to the ASR component, tasked with converting raw audio data into text representations. The model employs a hybrid of Neural Networks and Hidden Markov Models (NN/HMM). Training the acoustic model involves several hundred to thousands of hours of audio data with corresponding transcripts. An essential aspect of this training is data augmentation to simulate real-world noisy conditions, an indispensable step given that the system prohibits user data collection for privacy reasons.

Model architectures vary, but the paper predominantly discusses a network architecture with 7 layers, adapted to run on Raspberry Pi 3. This includes TDNN and LSTMP layers, ensuring a delicate balance between model size, speed, and accuracy. The final model, nnet-256, offers an impressive trade-off, exhibiting both an optimal speed (less than real time on Raspberry Pi 3) and a substantial reduction in memory footprint.

LLMing and Natural Language Understanding

The SLU pipeline's LLMing component is pivotal, converting acoustic model outputs into structured text. This is achieved through a compositional approach using weighted Finite State Transducers (wFSTs). The LLM (LM) and NLU must be mutually consistent, which is ensured by training both components on the same data set normalized through a stringent preprocessing pipeline.

Training Dataset and Normalization

The training dataset encompasses written queries embodying various intents and corresponding entities. Effective normalization and tokenization are vital to maintaining consistency between training and embedded inference. The LM's construction uses a class-based approach where entities in the queries are replaced by placeholders to generalize across interchangeable values.

Dynamic LLM and On-Device Personalization

An innovative aspect of the Snips Voice Platform is the utilization of lazy compositions for creating a dynamic LM, significantly reducing memory usage and speeding up compositional builds. Additionally, on-device personalization through entity injection enhances the user experience by adopting user-specific vocabulary in real-time without necessitating cloud interaction.

Confidence Scoring

To address the prevalent issue of out-of-vocabulary words in specialized SLU systems, the paper introduces a confidence scoring mechanism based on confusion networks. This mechanism efficiently identifies erroneous decodings, thus filtering out uncertain words to optimize NLU performance.

Performance Evaluation and Results

The performance of the Snips Voice Platform is rigorously evaluated on two assistant models: SmartLights and Weather. The platform demonstrates exceptional generalization capabilities, maintaining high precision and recall rates even in unseen domains. The incorporation of confidence scoring bolsters the model's robustness against filler words, maintaining high end-to-end accuracy metrics. Notably, the LM's dynamic composition yields significant memory savings and execution speed improvements on small devices like the Raspberry Pi 3 and NXP imx7D.

Implications and Speculation on Future Developments

The implications of the research presented in this paper are extensive. From a practical standpoint, it underscores the feasibility of deploying high-performance SLU systems on embedded devices while maintaining stringent privacy standards. This aligns seamlessly with the increasing need for secure IoT applications, mitigating privacy concerns associated with cloud-dependent voice assistants.

From a theoretical perspective, the research underlines innovative approaches such as dynamic LM composition and data generation without user data. These methodologies can pave the way for developing scalable, privacy-first AI models suitable for other domains beyond voice recognition.

Conclusion

In conclusion, the Snips Voice Platform represents a significant stride in delivering secure and efficient SLU capabilities on embedded devices. The thorough exposition of the technical details, coupled with robust performance evaluations, provides a substantial foundation for future advancements in privacy-preserving voice interfaces. The platform's emphasis on privacy by design and on-device processing directly addresses contemporary data privacy concerns, marking a pivotal development in the field of AI and machine learning.

PDF Markdown