Creating a voice based GPT assistant using ChatGPT, Elevenlabs and Azure Speech Recognition

During this weekend I started working on a simple PoC, can we use ChatGPT with some speech recognition software to work as a virtual assistant similar to Alexa or Siri? First of we need to have the ability for some code to trigger based upon certain keywords or patterns. This is where OpenAI functions comes in. This allows us to parse content from a ChatGPT response and trigger a function, which can be anything (depending on our code)

Therefore, I decided to see what was possible, I ended up with this gpt-ai/ai.gpt at main · msandbu/gpt-ai (github.com)

Here’s a step-by-step summary of the example code.

  1. Initial Setup:
    • Necessary libraries for Azure, OpenAI, and Elevenlabs are imported.
    • API keys for Elevenlabs, OpenAI, and Azure are initialized.
    • A requirement for the code to function is the installation of the MPV library, which is used for python to collect audio.
  2. Azure Speech SDK Setup:
    • Azure’s Speech SDK is configured for recognizing user’s spoken commands.
  3. Weather Function:
    • A helper function get_current_weather is written to fetch the current weather for a given location using the OpenWeatherMap API.
  4. Function Handler:
    • A function handle_function_call interprets responses from OpenAI that ask for specific actions like fetching the weather. It maps function names to the appropriate function implementations.
  5. Chat Messages and Functions Setup:
    • messages stores the history of the conversation with the assistant.
    • functions defines the available functions that OpenAI’s model can call, including their parameters and descriptions.
  6. Main Loop:
    • In a continuous loop, the program prompts the user to speak.
    • The spoken message is recognized using Azure’s SDK.
    • This message is sent to OpenAI, which can reply with either a direct response or a request to call a function (like fetching the weather).
    • If a function is requested, the function is executed, and the result is sent back to OpenAI for further response.
    • The final response from OpenAI (either a direct response or based on the function’s output) is then converted into speech using Elevenlabs.
    • The generated speech is streamed to the user.

Here is a video showing the result. There is some silence there but that is because I muted my own voice during recording so the only audio you here is from Elevenlabs.

Leave a Reply

Scroll to Top