ARTIFICIAL INTELLIGENCE ASSISTANT

JARVIS AI
Multi-Model Assistant

Advanced AI assistant system integrating multiple language models (Groq, Cohere, Hugging Face) with voice recognition, natural language processing, task automation, real-time web search, image generation, and intelligent decision-making. Features speech-to-text, text-to-speech, and comprehensive system control.

3 AI APIs
Integrated
9 Modules
Components
Voice I/O
STT + TTS
Project Type Full-Stack AI System
Primary Language Python 3.x
AI Models Llama3-70B, Command-R-Plus, SD-XL
APIs Used Groq, Cohere, Hugging Face
Status βœ“ Functional

01 System Overview

🎯

Project Vision

Inspired by the fictional JARVIS from Iron Man, this project implements a real-world AI assistant capable of understanding voice commands, executing system tasks, generating content, searching the web, creating images, and maintaining natural conversations through advanced language models.

🧠

Multi-AI Architecture

Integrates three distinct AI systems: Groq's Llama3-70B for conversational intelligence and real-time search, Cohere's Command-R-Plus for decision-making and query classification, and Hugging Face's Stable Diffusion XL for high-quality image generation.

βš™οΈ

Core Capabilities

Voice recognition with multi-language support, natural language understanding, automated task execution (app control, web browsing, YouTube playback), content generation (essays, code, letters), real-time information retrieval, and AI-powered image creation.

02 System Architecture

Input Layer - Voice/Text
Speech-to-Text (Selenium + Web Speech API) Multi-Language Translation (mtranslate) Query Normalization
↓ Raw Query ↓
Decision Layer - Intent Classification
Cohere Command-R-Plus Query Type Detection Task Routing
↓ Classified Tasks ↓
Execution Layer - Task Processing
General Chatbot (Groq Llama3) Real-time Search (Google + Groq) Automation (Apps, YouTube, System) Content Generation (Groq Mixtral) Image Generation (Hugging Face SD-XL)
↓ Response ↓
Output Layer - Response Delivery
Text-to-Speech (edge-tts + pygame) GUI Display Task Execution Feedback

03 Core Modules & Components

Module 1
βœ“ Decision Engine

Model.py - Intent Classification Layer

AI Model Cohere Command-R-Plus
Purpose Query Type Detection
Function FirstLayerDMM()
Output Task List with Categories
Temperature 0.7 (Balanced)
βœ“ Classifies queries into 12+ task types
βœ“ Handles multi-task requests
βœ“ Routes to appropriate modules
β†’ Recognized intents: general, realtime, open, close, play, image generation, system control, content creation
β†’ Streaming response for real-time classification
β†’ Context-aware with chat history
Module 2
βœ“ Conversational AI

Chatbot.py - General Conversation Handler

AI Model Groq Llama3-70B-8192
Context 8,192 tokens
Function ChatBot()
Max Tokens 1,024 per response
Storage ChatLog.json (persistent)
βœ“ Maintains conversation history
βœ“ Real-time date/time awareness
βœ“ Context-aware responses
β†’ Streaming response generation
β†’ Automatic error recovery (chat log reset)
β†’ Professional English responses
Module 3
βœ“ Web Search

RealtimeSearchEngine.py - Live Information Retrieval

Search API Google Search (googlesearch-python)
AI Model Groq Llama3-70B
Results Top 5 search results
Function RealtimeSearchEngine()
Max Tokens 2,048 per response
βœ“ Fetches live web data
βœ“ AI-powered answer synthesis
βœ“ Source attribution
β†’ Combines Google search with AI summarization
β†’ Real-time date/time injection
β†’ Professional formatting
Module 4
βœ“ Task Control

Automation.py - System & App Control

App Control AppOpener (open/close apps)
Web Control webbrowser, pywhatkit
System keyboard (volume control)
Content Groq Mixtral-8x7B
Async Parallel task execution
βœ“ Open/close applications
βœ“ YouTube playback & search
βœ“ Google search automation
βœ“ Content generation (essays, code, letters)
βœ“ System volume control
β†’ AsyncIO for concurrent task execution
β†’ Fuzzy app name matching
β†’ Auto-saves generated content to Notepad
Module 5
βœ“ Image AI

ImageGeneration.py - AI Image Creation

Model Stable Diffusion XL Base 1.0
API Hugging Face Inference API
Batch Size 4 images per prompt
Quality 4K, Ultra High Details
Async Parallel generation
βœ“ Generate 4 variations per prompt
βœ“ Random seed diversity
βœ“ Auto-display generated images
β†’ AsyncIO concurrent image generation
β†’ High-quality 4K output
β†’ Automatic file management
Module 6
βœ“ Voice Input

SpeechToText.py - Voice Recognition

Engine Web Speech API (WebKit)
Driver Selenium WebDriver (Chrome)
Translation mtranslate (multi-language)
Mode Headless browser
Languages Configurable (English default)
βœ“ Continuous voice recognition
βœ“ Multi-language support
βœ“ Auto-translation to English
β†’ Query formatting & punctuation
β†’ Question detection & modification
β†’ Real-time status updates
Module 7
βœ“ Voice Output

TextToSpeech.py - Natural Voice Synthesis

Engine edge-tts (Microsoft Edge TTS)
Playback pygame mixer
Voice Configurable (multiple accents)
Pitch +5Hz (natural tone)
Speed +13% (optimized)
βœ“ Natural-sounding speech
βœ“ Long text handling
βœ“ Async audio generation
β†’ Smart truncation for long responses
β†’ Fallback messages for extended text
β†’ Audio cleanup & file management

04 Interactive System Demo

Experience a web-based demonstration of the JARVIS AI interface

Launch Interactive Demo β†’

05 Technology Stack & Dependencies

🐍

Python Libraries

AI/ML: groq, cohere, requests
Voice: edge-tts, pygame, mtranslate
Web: selenium, webdriver-manager
Automation: AppOpener, pywhatkit, keyboard
Parsing: beautifulsoup4, googlesearch-python
Async: asyncio (built-in)
Utils: python-dotenv, Pillow, rich

πŸ”‘

API Integration

Groq Cloud: Llama3-70B-8192, Mixtral-8x7B-32768
Cohere: Command-R-Plus (decision-making)
Hugging Face: Stable Diffusion XL Base 1.0
Google Search: Real-time information
Microsoft Edge TTS: Natural voice synthesis
Web Speech API: Browser-based STT

βš™οΈ

System Features

Architecture: Modular, async-capable
Storage: JSON chat logs, file-based state
Security: .env for API keys
Error Handling: Auto-recovery mechanisms
Performance: Parallel task execution
Scalability: Streaming responses

06 Technical Achievements

01

Multi-AI Integration

Successfully integrated three distinct AI systems (Groq, Cohere, Hugging Face) into a unified architecture. Implemented intelligent routing between models based on query type, leveraging each AI's strengths: Llama3 for conversation, Command-R-Plus for decision-making, and Stable Diffusion for image generation.

02

Voice Intelligence

Developed comprehensive voice I/O system combining Web Speech API for recognition with edge-tts for natural synthesis. Supports multi-language input with automatic translation, intelligent query formatting, and context-aware voice responses with smart truncation for long-form content.

03

Task Automation Engine

Built robust automation framework capable of controlling system applications, managing web browsers, playing YouTube content, generating professional documents, and executing system commands. Implements parallel task execution using AsyncIO for efficient multi-task handling.

07 Real-World Applications

πŸ’¬ Conversational AI Assistant

Natural language conversations with context retention. Can discuss topics, answer questions, provide recommendations, and maintain multi-turn dialogues with persistent chat history.

Example: "What's the weather like today?" β†’ Real-time search + AI summary
Example: "Tell me about quantum computing" β†’ Conversational response

πŸ” Information Retrieval

Real-time web search with AI-powered synthesis. Fetches current information from Google, processes multiple sources, and generates comprehensive answers with proper attribution.

Example: "Who won the latest Formula 1 race?" β†’ Live search + summary
Example: "What are the top tech news today?" β†’ Multi-source aggregation

πŸ–ΌοΈ Content Creation

Generate text documents (essays, applications, letters, code) and AI images. Text generation uses Mixtral-8x7B for high-quality content, saved automatically to Notepad. Image generation produces 4 high-quality variations using Stable Diffusion XL.

Example: "Write an application for leave" β†’ Generated document in Notepad
Example: "Generate image of futuristic city" β†’ 4 unique 4K images

βš™οΈ System Control

Voice-controlled system automation. Open/close applications, control YouTube playback, manage volume, execute Google searches, and navigate the webβ€”all through natural voice commands.

Example: "Open Chrome and play music on YouTube" β†’ Multi-task execution
Example: "Mute volume" β†’ System command execution