AudioViz is a powerful Python-based tool that transforms audio content into a sequence of AI-generated images, creating visual narratives from sound. Using state-of-the-art AI models from OpenAI and Anthropic, it analyzes audio content and generates corresponding visual representations.
-
Smart Audio Processing
- Automatic audio compression and segmentation
- Support for various audio formats
- Intelligent handling of large audio files
-
Advanced Content Analysis
- Speech-to-text transcription using OpenAI's Whisper
- Context-aware text analysis
- Automated visual style guide generation
-
AI Image Generation
- High-quality image generation using DALL-E 3
- Consistent visual styling across sequences
- Customizable image parameters (size, quality, style)
- YouTube Content Creation
- Podcast Visualization
- Educational Content
- Story Visualization
- Music Video Creation
- Speech Visualization
- Python 3.11+
- FFmpeg (Required for audio processing)
- OpenAI API key
- Anthropic API key (optional, for Claude-based text analysis)
- macOS (using Homebrew):
brew install ffmpeg
- Ubuntu/Debian:
sudo apt-get update sudo apt-get install ffmpeg
- Windows: Download from FFmpeg official website
-
Clone the repository:
git clone https://github.com/aLVINlEE9/AudioViz.git cd AudioViz
-
Install dependencies using Poetry:
poetry install
-
Create
.env
file:OPENAI_API_KEY=your_openai_api_key ANTHROPIC_API_KEY=your_anthropic_api_key # Optional
Basic usage example:
from audio_viz import AudioVisualizer
config = {
"openai_api_key": "your_openai_api_key",
"text_analysis": {
"provider": "openai", # or "anthropic"
"max_tokens": 1000,
"max_segments": 5
},
"image_generation": {
"size": "1024x1024",
"quality": "standard",
"style": "vivid"
}
}
visualizer = AudioVisualizer(config, verbose=True)
visualizer.process_audio("input.wav", "output_directory")
provider
: Choose between "openai" or "anthropic"max_tokens
: Maximum tokens for LLM responsemax_segments
: Target number of visual segments
size
: "1024x1024", "1024x1792", or "1792x1024"quality
: "standard" or "hd"style
: "vivid" or "natural"
- Enhanced prompting system for more detailed and accurate scene descriptions
- Video output support (combining audio with generated images)
- Additional image generation models support (Midjourney, Stable Diffusion)
- Custom style presets and templates
- Batch processing support
- Multiple language support
This project is licensed under the MIT License - see the LICENSE file for details.
For support, please open an issue in the repository or contact the maintainers.