Rudimentary Chat Bot – OpenVoice, Whisper & LLM

Using ollama, and the dolphin-2.5-mixtral-8x7b model a simple script to test TTS and requests from Ollama server is being constructed. These are some reference notes.

  1. https://stackoverflow.com/questions/17657103/play-wav-file-in-python -- Playing Audio in Python
  2. https://moez-62905.medium.com/the-ultimate-guide-to-command-line-arguments-in-python-scripts-61c49c90e0b3#:~:text=In%20Python%2C%20command%2Dline%20arguments,arguments%20passed%20to%20the%20script. -- Using command line arguments within python script.
  3. https://ioflood.com/blog/python-run-shell-command/ -- Running shell command in python using subprocess.
  4. https://www.youtube.com/watch?v=1NonzlRr6JA -- Robot vocals

Okay this was successful and resulted in a script called askbot.py which is run from a shell script called askbot.sh.

Recording Audio for an undetermined time

This is the next obvious step, recording audio for an undetermined amount of time... specifically until the user signals that the audio is done being inputted, or preferably when silence is detected from the microphone. To that end, some research was conducted resulting in the following code:

  1. https://stackoverflow.com/questions/18406570/python-record-audio-on-detected-sound -- stackoverflow result for user defined recording interval

Okay, with a few minor changes, (Noise threshold to 30db and changing the recording path to /home/sparkone/sdb/projects/python/OpenVoice/recordings) this script appears to be functioning quite nicely.

Passing audio files to whisper for speech-to-text transcription

With the audio recording functioning as desired, the next step is getting the recorded audio transcribed into text. Checking for a stop, quit or goodbye type keyword would should suffice for ending the script.

For this task, we can use whisper, or faster-whisper to transcribe the audio offline.

  1. https://github.com/openai/whisper -- whisper project on github
  2. https://github.com/SYSTRAN/faster-whisper -- faster-whisper project on github
  3. https://github.com/huggingface/distil-whisper -- dustil-whisper

In order to facilitate functionality of this project, a simple script which takes audio files in will be used initially, then incorporated into the main project. So far we have askbot.py and record_3.py (3rd attempt at generating a recording script).

Update: This worked, as a script. Final iteration of script askbot_3.py.

NOTES: CUDA memory erros and segmentation faults are resulting from allocation of VRAM outside scopes of reservation. A potential solution is to delegate tasks to separate GPUs. Also upgrading GPU has become important due to this and latency issues. Ideally, a Quadro p8000 would be desired as it offers parallel memory sharing options in the current z series architecture I am using, however cost is prohibitive and a more economical option which does not offer parallel memory sharing in the z series architecture.

NOTE: See Neural Network Research Post for specific details regarding PyTorch, Ollama and CUDA

Segmentation faults seem to be occurring less, after adjusting model parameters and introducing garbage collection to the script. Additional measures include deleting local variables after use and cleaning cache.

It has become apparent that while the recorded audio is being processed, it is still picking up voice audio on the microphone, the recording stream needs to be closed before calling the ollama query and playback function.

In order to work around repeat answers and to speed up response time, perhaps utilizing the JSON stream approach would be a practical solution.

  1. https://github.com/ollama/ollama/blob/main/docs/api.md -- Ollama docs
  2. https://www.w3schools.com/python/python_json.asp#:~:text=Parse%20JSON%20%2D%20Convert%20from%20JSON,will%20be%20a%20Python%20dictionary. -- Convert JSON to Python Dictionary

Request:

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?"
}'

Response (Stream of JSON objects):

{
  "model": "llama2",
  "created_at": "2023-08-04T08:52:19.385406455-07:00",
  "response": "The",
  "done": false
}

See the above docs page for more options, such as runtime parameters and showing a model list.

Using GPTQ models

Using GPTQ models requires a bit more work than simply downloading and creating a model file as with .gguf. See the following reference:

  1. https://github.com/ollama/ollama/blob/main/docs/import.md -- Installing and quantizing gptq models

Apparently this just makes another CPU model out of a GPTQ model. The intention here to fully utilize an NVidia RTX Titan and it's 25gb VRAM. Trying something else...

This looks like it must be run in python, see below reference for a starting point:

  1. https://www.reddit.com/r/LocalLLaMA/comments/13b3s4f/how_to_run_starcodergptq4bit128g/ -- Run .gptq locally?
  2. https://huggingface.co/TheBloke/medalpaca-13B-GPTQ/discussions/3#:~:text=To%20load%20GPTQ%20models%20from,have%20the%20CUDA%20toolkit%20installed. -- Run GPTQ models locally (theBloke).
  3. https://github.com/oobabooga/text-generation-webui -- Oobabooga runs GPTQ

Okay so the oobabooga/web-ui works for running models fast in the GPU. The issue of extensions is a WIP.

Oobabooga/text-generation-webui Extensions List

  1. https://github.com/oobabooga/text-generation-webui-extensions -- List of extensions available

Superbooga (SuperboogaV2)

Getting the following errors when attempting to load extension superbooga:

01:45:50-406664 ERROR    Failed to load the extension "superbooga".                                                                                                                                                                          
Traceback (most recent call last):
  File "/home/sparkone/sdb/projects/python/text-generation-webui/modules/extensions.py", line 36, in load_extensions
    exec(f"import extensions.{name}.script")
  File "<string>", line 1, in <module>
  File "/home/sparkone/sdb/projects/python/text-generation-webui/extensions/superbooga/script.py", line 10, in <module>
    from .chromadb import add_chunks_to_collector, make_collector
  File "/home/sparkone/sdb/projects/python/text-generation-webui/extensions/superbooga/chromadb.py", line 1, in <module>
    import chromadb
  File "/home/sparkone/sdb/projects/python/text-generation-webui/installer_files/env/lib/python3.11/site-packages/chromadb/__init__.py", line 1, in <module>
    import chromadb.config
  File "/home/sparkone/sdb/projects/python/text-generation-webui/installer_files/env/lib/python3.11/site-packages/chromadb/config.py", line 1, in <module>
    from pydantic import BaseSettings
  File "/home/sparkone/sdb/projects/python/text-generation-webui/installer_files/env/lib/python3.11/site-packages/pydantic/__init__.py", line 374, in __getattr__
    return _getattr_migration(attr_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sparkone/sdb/projects/python/text-generation-webui/installer_files/env/lib/python3.11/site-packages/pydantic/_migration.py", line 296, in wrapper
    raise PydanticImportError(
pydantic.errors.PydanticImportError: `BaseSettings` has been moved to the `pydantic-settings` package. See https://docs.pydantic.dev/2.6/migration/#basesettings-has-moved-to-pydantic-settings for more details.

For further information visit https://errors.pydantic.dev/2.6/u/import-error

This error is apparently common on the user install regardless of the operating system used. The following version corrections worked:

./cmd_linux.sh ## Load the conda environment to select the correct pip
pip install chromadb==0.3.18 ## Select the compatible version of chromadb for superbooga
pip install pydantic==1.10.14 ## Select the compatible version of pydantic for superbooga

This worked.

NOTE: To use ingested data as context, the notebook mode must be utilized and the following tags employed:

"There is technically documentation for it, but you are right that it's not super obvious. Right under the superbooga header there is a thin pane with the text "Click for more information..." if you click on that it will expand and give you a somewhat detailed explanation of how superbooga works.

But I can summarize the key information. For one, superbooga operates differently depending on whether you are using the chat interface or the notebook/default interface. In the chat interface it does not actually use the information you submit to the database, instead it automatically inserts old messages into the database and automatically retrieves them based on your current chat message.

To query information you have submitted you need to change to the notebook/default interface. In those modes you can insert the following three special tokens:"

<|begin-user-input|>
<|end-user-input|>
<|injection-point|>
  1. https://github.com/oobabooga/text-generation-webui/issues/2824 -- Reference for superbooga data injection.

EmotiVoice (Decent Latency)

See:

  1. https://github.com/yhyu13/Emotivoice_TTS -- Installation instructions, after downloading the model using the UI

It looks like this one uses some sort of remote access, it would be preferable to use OpenVoice locally...

TODO: Find or write extension for OpenVoice, as it has decent latency as well

LucidWebSearch

This is an interesting extension which can use OCR models (e.g; nougat) to retrieve and utilize web resources and optionally parse/generate markup containing mathematical expressions.

See:

  1. https://github.com/RandomInternetPreson/LucidWebSearch?tab=readme-ov-file -- Web Search and Math Search Extension

OpenVoice Extension

After using OpenVoice in a script testing listening and STT/TTS around an ollama GGUF model, I desired to implement a custom extension utilizing it as well for oobabooga/text-generation-webui. However I encountered the following errors after pip install -r requirements in extensions/OpenVoice:

INFO: pip is looking at multiple versions of wavmark to determine which version is compatible with other requirements. This could take a while.
ERROR: Ignored the following yanked versions: 2.0.0
ERROR: Ignored the following versions that require a different python version: 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11
ERROR: Could not find a version that satisfies the requirement torchaudio<2.0 (from wavmark) (from versions: 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1)
ERROR: No matching distribution found for torchaudio<2.0

Not sure how to proceed from here... Is it possible to use 2 different virtual environments in the same running instance of the webui?

TODO: Check and determine whether this is possible/practical.

Context Window Research and Methods

The context window size is an important factor in the practical application of models, and to that end various articles are included as reference for research notes.

  1. https://medium.com/@ddxzzx/why-and-how-to-achieve-longer-context-windows-for-llms-5f76f8656ea9#:~:text=Once%20we%20have%20efficiently%20incorporated,fit%20the%20new%20context%20length. -- Article on RoPE and other context enlargement methods

Research References

  1. https://en.wikipedia.org/wiki/DeepSpeed -- DeepSpeed Wiki

Leave a Reply

Your email address will not be published. Required fields are marked *