Tag

Multimodal

All articles tagged with #multimodal

Google’s AI-Driven Search Overhaul Introduces Agents and Multimodal Capabilities
technology1 day ago

Google’s AI-Driven Search Overhaul Introduces Agents and Multimodal Capabilities

Google unveiled a sweeping AI-powered redesign of Search, adding a longer, more conversational interface, multimodal inputs (photos, videos, documents), and autonomous “agents” built on Gemini 3.5 Flash to monitor topics automatically. A new Gemini Spark integrates AI into Gmail and Docs, while a revamped shopping experience surfaces discounts. The changes aim to boost usage and ad-targeting but raise concerns about transparency, user choice, and potential reductions in traffic to third-party sites, fueling debate about the future of the open web.

Google unveils Gemini Omni: a multi-modal AI world model for video creation and editing
technology7 days ago

Google unveils Gemini Omni: a multi-modal AI world model for video creation and editing

Google introduces Gemini Omni, a new multi-modal AI world-model family led by Gemini Omni Flash that can accept text, audio, images, and video to generate and edit realistic videos with improved physics; it supports avatars and interactive edits via conversation, with outputs watermarked by SynthID for AI verification, and will roll out to paid Google AI plans before expanding to YouTube Shorts and YouTube Create.

Gemma 4: Google's 2.3B AI Rivals 70B Giants on Minimal RAM
technology1 month ago

Gemma 4: Google's 2.3B AI Rivals 70B Giants on Minimal RAM

Google’s Gemma 4 proves you don’t need a giant parameter count to punch above your weight: with 2.3 billion parameters it rivals 70B-scale models in performance while running offline on under 1.5 GB of RAM. The model is designed for edge use, features a 128K context window, and supports text, vision, and audio inputs in a multimodal architecture under an Apache 2.0 open-source license. It also handles 140 languages and emphasizes privacy and offline operation, making it suitable for resource-constrained devices. Notable caveats include coding task limitations and incomplete iOS bindings, with ongoing community-driven improvements anticipated.

Gemma 4 Goes Local: Google's Offline, Multimodal AI Hits Your Device
technology1 month ago

Gemma 4 Goes Local: Google's Offline, Multimodal AI Hits Your Device

Google’s Gemma 4 is an open-source, locally installable multimodal AI that runs on smartphones and laptops without needing cloud processing, prioritizing privacy and cost efficiency. It comes in dense (31B parameters) and sparse (26B parameters) architectures using a mixture-of-experts approach to balance performance and efficiency. Capable of text, image, and audio processing, it targets coding, creative writing, UI design, healthcare and education, with deployment support through LM Studio, Ollama, Llama CPP and Supabase. Four device-optimized versions offer offline functionality and reduced reliance on cloud services, pushing accessible, private AI for areas with limited connectivity.

Meta launches Muse Spark to power AI across its apps and devices
tech1 month ago

Meta launches Muse Spark to power AI across its apps and devices

Meta unveiled Muse Spark, its first model in a new Muse series, to run in the Meta AI app and website in the US and roll out to WhatsApp, Instagram, Facebook, Messenger, and Meta’s smart glasses. The multimodal, multi‑agent system will be available to partners via API, supports text+image input, and offers Instant and Thinking modes for faster vs. deeper responses. Meta says Muse Spark can tackle complex science, math, and health questions and aims to compete with OpenAI’s ChatGPT Health and Claude, with future versions planned to be open-sourced as the company expands its AI efforts after earlier Llama initiatives.

Meta launches Muse Spark, the Alexandr Wang–led AI model
technology1 month ago

Meta launches Muse Spark, the Alexandr Wang–led AI model

Meta unveiled Muse Spark, its first AI model developed under Alexandr Wang, aiming to narrow the gap with OpenAI and Anthropic. The model will power queries in the Meta AI app and Meta.ai, with plans to expand to Facebook, Instagram, and WhatsApp, and accepts voice, text, and image inputs but outputs text. It includes modes like a fast query mode and a 'shopping mode' tied to user data; an open-source version is planned. While competitive on several tasks, it lags in areas like coding. All flavors are free but may be rate-limited, and Meta’s privacy policy indicates broad data use for AI features.

Gemini 3.1 Pro powers deep work with 7 actionable prompts
technology3 months ago

Gemini 3.1 Pro powers deep work with 7 actionable prompts

Google's Gemini 3.1 Pro is pitched as a versatile, multi-modal AI capable of long-context, multi-step reasoning. The article tests its capabilities and presents seven concrete prompts—long-document extraction, logic stress testing, code architecture, getting unstuck fast, contextual video analysis, synthetic data generation, and deep research synthesis—to turn the model into an execution engine for productive, deep work.

Apple Develops Unified AI Model for Vision, Creation, and Editing
technology5 months ago

Apple Develops Unified AI Model for Vision, Creation, and Editing

Apple researchers have developed UniGen-1.5, a unified AI model capable of understanding, generating, and editing images within a single system, enhancing previous models with new editing capabilities and improved instruction alignment, achieving state-of-the-art performance on various benchmarks, though it still faces challenges with text generation and identity consistency.

Apple Unveils 2025 Foundation Language Models Report
technology10 months ago

Apple Unveils 2025 Foundation Language Models Report

Apple has developed two advanced multilingual, multimodal foundation language models: a 3-billion-parameter on-device model optimized for Apple silicon and a scalable server model using a novel PT-MoE transformer, both supporting multiple languages and image understanding. These models power Apple Intelligence features across devices and services, with a focus on responsible AI, privacy, and developer integration through a new Swift-based framework. They outperform comparable open models in benchmarks and human evaluations, enhancing user experiences with efficient, accurate, and responsible AI capabilities.

Google AI Mode Launches in India as First International Expansion
technology11 months ago

Google AI Mode Launches in India as First International Expansion

Google is expanding its AI Mode feature to India, marking its first international launch after debuting in the US. Powered by Gemini 2.5, AI Mode enhances search with advanced reasoning, multimodal inputs including voice and images, and real-time data sources. Users in India can now access AI Mode via the Google app or web, enabling more interactive and detailed search experiences.

technology1 year ago

Mistral Unveils Devstral and Agent Frameworks for AI Coding and Enterprise Solutions

Mistral AI has launched an API enabling developers to create customizable AI agents capable of tasks like code execution, image generation, and web search, with features supporting complex workflows and real-time interactions, aimed at enterprise and developer use. The API enhances AI capabilities beyond traditional language models by integrating real-world data sources and managing multiple agents, positioning Mistral as a key player in enterprise AI solutions. However, the proprietary nature of the models and API may influence adoption decisions.

Google's Gemini 2.0: Pioneering the Agentic AI Era
technology1 year ago

Google's Gemini 2.0: Pioneering the Agentic AI Era

Google has launched Gemini 2.0, an advanced AI model capable of generating text, images, and speech while processing various input types. The Gemini 2.0 Flash model, part of this new family, offers enhanced performance and speed compared to its predecessor. Initially available to developers, its full features will be accessible to early access partners by January 2025. Google is integrating this technology into its products and has implemented SynthID watermarking to prevent misuse of AI-generated content. The company emphasizes the development of "agentic" AI systems that can autonomously perform tasks with user supervision.

Google's Gemini 2.0: Pioneering AI with Text, Image, and Speech Generation
technology1 year ago

Google's Gemini 2.0: Pioneering AI with Text, Image, and Speech Generation

Google has unveiled Gemini 2.0 Flash, its latest AI model capable of generating text, images, and audio, and interacting with third-party apps. The model, which is twice as fast as its predecessor, will initially be available to early access partners, with a broader release planned for January. It features enhanced capabilities in coding and image analysis, and uses SynthID technology to watermark outputs to prevent misuse. Google is also launching the Multimodal Live API for developers to create real-time apps with audio and video streaming.

OpenAI Hires Leading Engineers from DeepMind
technology1 year ago

OpenAI Hires Leading Engineers from DeepMind

OpenAI has hired three senior engineers from Google DeepMind to work on multimodal AI at its new Zurich office. The hires, Lucas Beyer, Alexander Kolesnikov, and Xiaohua Zhai, reflect the intense competition among AI companies to secure top talent. OpenAI, known for its advancements in multimodal AI, is expanding globally with new offices planned in several cities. The move comes amid a broader trend of high-profile talent shifts in the AI industry, as companies like Microsoft and Google also engage in aggressive recruitment strategies.