Workshop #01: Giving voice to our NLU

At Golem.ai, we offer a solution of analytical AI with InboxCare, but we’re also a company of AI experts and enthusiasts. Therefore, our Tech team exercises in workshops in a fun way on new uses of our NLU, which we share in this article. This workshop is aimed in particular at Speech-to-Text (STT) and Text-to-Speech (TTS). These technologies, which respectively make it possible to convert voice into text and vice versa, are already omnipresent in our daily lives (Google Assistant, Siri, Alexa, etc.), but connected to an NLU, use cases are multiplying. Thus, discover in this article four innovative projects pooling STT, TTS, SVI and NLU.

1. ChatGPT and Dungeons & Dragons: An Immersive Experience with STT/TTS

By Emmanuel and Justin

The combined use of STT and TTS has enabled an enriched role-playing experience for Dungeons & Dragons fans. In a previous workshop, an application to play Dungeons & Dragons with several LLMs had been developed.By transforming textual interactions with ChatGPT into voice dialogues, the experience becomes more immersive and accessible. Players can interact directly with the AI through voice, no longer need to read large amounts of text, the text being read with good intonation, and can even subtly influence its decisions, enriching the gameplay. 

Conclusion : The integration of STT/TTS in role-playing games offers a more fluid and engaging experience (customization of requests made to AI, modification of context and characters, etc.), although leading this AI remains a challenge due to the nature of LLMs.

2. Drone Voice Control: STT and NLU in action

By Anne-Sophie and Amandine

Ce projet permet de contrôler un drone via des commandes vocales. Le STT traduit la parole en commandes textuelles, et le Natural Language Understanding (NLU) interprets these commands to fly the drone. Although the system works, the STT could fail to correctly transcribe commands, these being short sentences, with little context to correct. 

Conclusion : L’utilisation du STT et du NLU dans la commande vocale est prometteuse, mais nécessite une spécialisation, que ce soit sur le vocabulaire ou bien la voix de la personne, pour gérer le manque de contexte. 

3. Enhanced Interactive Voice Server (IVR) with STT and NLU

By William and Willem

Traditionally, IVR’s are limited to keyboard commands. By replacing this with the STT coupled with an NLU to understand the user’s intent, the experience becomes more intuitive. However, the configuration of the NLU requires more investment beforehand, which prevented us from producing a configuration covering all of what we wanted. 

Conclusion : IVR modernization by STT and NLU is a step towards more natural interaction, but requires careful NLU configuration.

Aside : The two previous projects both use STT to retrieve text from the user, and then our NLU to analyze the text. But they are quite different: voice control requires a specialization of the STT (on voice, on actions) to be able to effectively manage short commands, but it is very easy to analyze the command because the field of possible actions is well defined. Conversely, the SVI benefits from the STT, but requires more time to configure the NLU to cover a wide area. This balance must be taken into account to maintain relevance in such tools.

4. Virtual Assistant “Y”: A custom ChatGPT interface

By Vincent, Arthur and Kevin

This innovative project combines STT, ChatGPT, and TTS to create a virtual assistant based on the voice of a real person, which will be called “Y” using only 5 minutes of voice recordings for training, the generated voice is already recognizable, although perfectible. Moreover, video synthesis was made to animate the mouth of Dark Y to give an even more realistic effect.

Conclusion : Customizing voice assistants with real voices is not only feasible, but also quick to implement, paving the way for more personalized and engaging user interfaces. This, explains the skyrocketing of many fakes TV reports.

At the end of the day

The majority of these projects were completed in an afternoon, simply by connecting the APIs of the different tools together. The LLMs gave interesting and usable results very quickly, the STT and the TTS opened new possibilities in terms of user experiences, new possibilities already opened by voice assistants such as Siri, OK Google, Alexa or Cortana.

But in our case, these results must still be sharpened to be usable on a large scale, the apparent ease of implementation can be misleading. To this is also added the problem of the high cost of each use of STT or TTS. Our NLU is here an interesting solution to amortize this cost: being a frugal AI and therefore affordable on a large scale, and producing a structured and precise output, Connecting this NLU to the output of a STT provides an accurate and effective understanding of the intent contained in a user’s request.