Advanced AI assistant system integrating multiple language models (Groq, Cohere, Hugging Face) with voice recognition, natural language processing, task automation, real-time web search, image generation, and intelligent decision-making. Features speech-to-text, text-to-speech, and comprehensive system control.
Inspired by the fictional JARVIS from Iron Man, this project implements a real-world AI assistant capable of understanding voice commands, executing system tasks, generating content, searching the web, creating images, and maintaining natural conversations through advanced language models.
Integrates three distinct AI systems: Groq's Llama3-70B for conversational intelligence and real-time search, Cohere's Command-R-Plus for decision-making and query classification, and Hugging Face's Stable Diffusion XL for high-quality image generation.
Voice recognition with multi-language support, natural language understanding, automated task execution (app control, web browsing, YouTube playback), content generation (essays, code, letters), real-time information retrieval, and AI-powered image creation.
Experience a web-based demonstration of the JARVIS AI interface
AI/ML: groq, cohere, requests
Voice: edge-tts, pygame, mtranslate
Web: selenium, webdriver-manager
Automation: AppOpener, pywhatkit, keyboard
Parsing: beautifulsoup4, googlesearch-python
Async: asyncio (built-in)
Utils: python-dotenv, Pillow, rich
Groq Cloud: Llama3-70B-8192, Mixtral-8x7B-32768
Cohere: Command-R-Plus (decision-making)
Hugging Face: Stable Diffusion XL Base 1.0
Google Search: Real-time information
Microsoft Edge TTS: Natural voice synthesis
Web Speech API: Browser-based STT
Architecture: Modular, async-capable
Storage: JSON chat logs, file-based state
Security: .env for API keys
Error Handling: Auto-recovery mechanisms
Performance: Parallel task execution
Scalability: Streaming responses
Successfully integrated three distinct AI systems (Groq, Cohere, Hugging Face) into a unified architecture. Implemented intelligent routing between models based on query type, leveraging each AI's strengths: Llama3 for conversation, Command-R-Plus for decision-making, and Stable Diffusion for image generation.
Developed comprehensive voice I/O system combining Web Speech API for recognition with edge-tts for natural synthesis. Supports multi-language input with automatic translation, intelligent query formatting, and context-aware voice responses with smart truncation for long-form content.
Built robust automation framework capable of controlling system applications, managing web browsers, playing YouTube content, generating professional documents, and executing system commands. Implements parallel task execution using AsyncIO for efficient multi-task handling.
Natural language conversations with context retention. Can discuss topics, answer questions, provide recommendations, and maintain multi-turn dialogues with persistent chat history.
Real-time web search with AI-powered synthesis. Fetches current information from Google, processes multiple sources, and generates comprehensive answers with proper attribution.
Generate text documents (essays, applications, letters, code) and AI images. Text generation uses Mixtral-8x7B for high-quality content, saved automatically to Notepad. Image generation produces 4 high-quality variations using Stable Diffusion XL.
Voice-controlled system automation. Open/close applications, control YouTube playback, manage volume, execute Google searches, and navigate the webβall through natural voice commands.