PromptZone - Leading AI Community for Prompt Engineering and AI Enthusiasts: Riya Morales

Gemini API Multimodal File Search Update

Riya Morales — Sun, 10 May 2026 06:25:52 +0000

Google's Gemini API has rolled out an update to its file search feature, making it multimodal for enhanced retrieval-augmented generation (RAG) workflows, as flagged in a Hacker News discussion that garnered 48 points and 4 comments.

API: Gemini API | Features: Multimodal search (text + images) | Available: Google Cloud Platform | Price: Pay-as-you-go

What It Is and How It Works

Gemini API's file search now supports multimodal inputs, allowing users to query files using both text and images simultaneously. For instance, developers can upload an image of a chart and pair it with a text prompt to retrieve relevant documents from a database. This builds on Google's existing RAG system by integrating computer vision elements, processing queries through a unified model that outputs ranked results based on semantic matching.

Benchmarks and Specs

The update claims faster query times for multimodal searches, with internal benchmarks showing average response times under 2 seconds for combined text-image inputs on standard hardware. According to the Google blog, this represents a 40% improvement in latency compared to previous versions for similar tasks. Key specs include support for up to 10MB file uploads per query and compatibility with Gemini 1.5 models, which handle contexts up to 1 million tokens.

Spec	Gemini API Multimodal	Previous Gemini File Search
Query Types	Text + Image	Text Only
Response Time	<2 seconds	~3.5 seconds
Max File Size	10 MB	5 MB
Pricing	$0.01 per 1,000 tokens	$0.01 per 1,000 tokens

How to Try It

Developers can start by signing up for the Google Cloud console and enabling the Gemini API. Begin with the Python SDK: install via pip install google-cloud-aiplatform, then use sample code like client.search_files(query="describe this image", file=uploaded_image). For a quick test, visit the Google AI Studio playground to experiment with multimodal queries without full setup. Full documentation is available on the official Google Cloud docs.

"Full Setup Steps"

Create a Google Cloud project and enable the Vertex AI API.
Generate an API key from the credentials page.
Use the SDK to upload files and run queries, ensuring your region supports multimodal features.

Pros and Cons

The multimodal capability boosts accuracy for real-world applications, such as analyzing visual data in legal or medical documents, with early testers noting a 25% increase in relevant results per query. However, it requires more computational resources, potentially raising costs for high-volume users. On the positive side, integration with existing RAG pipelines is seamless, but cons include limited support for video inputs, which could frustrate creators in multimedia fields.

Bottom line: This update delivers tangible efficiency gains for text-image searches but may not suit users with strict budget constraints.

Alternatives and Comparisons

Other options include OpenAI's Assistants API, which supports multimodal inputs via GPT-4o, and Anthropic's Claude for RAG tasks. Compared to Gemini, OpenAI offers broader model customization but at higher costs, while Claude emphasizes safety features.

Feature	Gemini API Multimodal	OpenAI Assistants API	Anthropic Claude
Multimodal Support	Text + Image	Text + Image + Video	Text + Image
Pricing (per 1K tokens)	$0.01	$0.02	$0.015
Latency	<2 seconds	~1.5 seconds	~2 seconds
Ecosystem	Google Cloud	OpenAI Platform	Anthropic Console

Gemini stands out for its free tier accessibility, making it ideal for beginners, whereas OpenAI requires more setup for enterprise-scale deployments.

Who Should Use This

AI developers working on content management systems or educational tools will benefit most, as the multimodal search simplifies handling diverse data types without custom integrations. Avoid it if you're in resource-limited environments, like edge devices, where the API's cloud dependency could lead to higher latency. Startups with RAG needs should prioritize this for its cost-effectiveness, but large enterprises might prefer in-house solutions for data privacy.

Bottom Line and Verdict

This expansion positions Gemini as a practical choice for multimodal RAG, outpacing competitors in affordability for everyday developers. In summary, it's a solid upgrade that enhances file search versatility, though users should weigh its cloud reliance against on-premise alternatives for optimal results.

The multimodal file search feature could accelerate AI adoption in sectors like e-commerce, where visual product queries drive better customer experiences, potentially setting a new standard for accessible RAG tools in the next year.

Local-First Agentic Knowledge Manager

Riya Morales — Fri, 08 May 2026 12:26:11 +0000

egroup-labs released kept, a local-first agentic knowledge manager designed for offline AI workflows, which gained 15 points in a brief Hacker News discussion.

Tool: kept | Type: Agentic Knowledge Manager | Available: GitHub | License: MIT (as per repository)

What It Is and How It Works

kept is an open-source tool that enables AI agents to handle knowledge management tasks directly on your local machine, without relying on cloud services. It uses agentic architecture, where AI models autonomously organize, query, and update personal knowledge bases based on user inputs. For instance, agents can process documents, extract insights, and generate summaries, all while keeping data encrypted and local to avoid privacy leaks.

Benchmarks and Specs

While kept lacks detailed benchmarks in its initial release, it's optimized for consumer hardware, running efficiently on standard laptops with at least 8 GB RAM. Early users on Hacker News noted it processes simple queries in under 5 seconds on an Intel i7 processor, compared to cloud alternatives that often add latency. This local focus means it uses minimal resources—typically under 2 GB of memory—making it suitable for edge devices.

Spec	kept	Typical Cloud Tool (e.g., Notion AI)
Speed	<5s per query	10-20s with network delay
Resource Use	<2 GB RAM	Variable, often requires internet
Data Privacy	Fully local	Cloud-stored, potential breaches

How to Try It

To get started with kept, clone the repository from GitHub and install via Python, as it's built on standard libraries like LangChain. Run git clone https://github.com/egroup-labs/kept followed by pip install -r requirements.txt, then launch with python main.py to set up your first agent. For beginners, the README includes sample configurations for integrating with local LLMs like Llama 3, allowing immediate testing of knowledge queries.

"Full Setup Steps"

Download and install Python 3.10 or later
Install dependencies with the above pip command
Configure your API keys if using external models, though kept supports offline modes
Test with a simple command: kept query "Summarize this document"

Pros and Cons

kept excels in privacy, as it processes all data locally without external servers, reducing risks of data exposure. Its agentic design automates routine tasks like note organization, saving developers time—up to 30% in workflow efficiency based on similar tools' user reports. However, it may lack advanced features like multi-user collaboration, which could limit team use.

Pros: Offline operation ensures data security; lightweight for daily use; integrates easily with existing local AI setups
Cons: Limited to basic agent capabilities; requires technical setup; no built-in GUI, relying on command-line interfaces

Alternatives and Comparisons

kept stands out among knowledge managers by emphasizing agentic AI, but it competes with tools like Obsidian for note-taking and LangChain for agent workflows. Unlike Obsidian, which focuses on manual organization, kept automates tasks with AI agents, though it trails LangChain in scalability for complex applications.

Feature	kept	Obsidian	LangChain
AI Automation	Full agent support	Plugins only	Extensive
Privacy	Local-only	Local with sync	Cloud-dependent
Ease of Use	Command-line	User-friendly UI	API-heavy
Price	Free (open-source)	Free core, paid plugins	Free, with enterprise options

For AI practitioners, kept is ideal for prototyping, while LangChain suits production-scale projects.

Who Should Use This

Developers building privacy-sensitive applications, such as personal assistants or research tools, should consider kept for its offline capabilities and low barrier to entry. It's particularly useful for those with older hardware, as it runs on machines with 8 GB RAM, but beginners might skip it due to the need for coding knowledge—opt for more polished alternatives if you're not comfortable with command-line setups.

Bottom Line and Verdict

kept delivers a practical, agent-driven approach to local knowledge management, outpacing cloud tools in speed and privacy for individual users. While it doesn't match the feature depth of LangChain, its lightweight design makes it a smart choice for edge computing experiments, potentially saving hours on data handling for solo developers.

In the evolving AI landscape, tools like kept could push more projects toward decentralized workflows, fostering innovation in secure, local-first applications.