Agentic Brand
Posts
Agent2Agent in Ecom

Agent2Agent in Ecom

Google's new A2A protocol makes "have your agent call my agent" possible.

April 29, 2025

I've spent the last few months obsessing over how to make videos work for AI-powered ad creation. After building our Scene Prep Aigent at aigencia that turns raw videos into structured semantic data, I'm seeing how specialized AI agents could talk directly to each other using Google's new Agent2Agent (A2A) protocol.

From Product Image to Dynamic Ads with A2A

A powerful example of A2A's capabilities is transforming simple product images into engaging ads, both static and animated:

We start with a brand's basic product image, handled by our first specialized agent (Agent 1: GPT-Image-1), which turns the product image into a high quality static ad.
Through Google's Agent-to-Agent (A2A) protocol, this static ad image is seamlessly passed to another specialized creative agent (Agent 2: Runway Gen4 Turbo).
Runway Gen4 Turbo then takes the ad image to create a visually compelling animated video ad.

By automating the end-to-end creative process from product image to finalized ad creatives, A2A dramatically reduces turnaround time from days to mere minutes, enabling brands to rapidly scale high-impact ad campaigns.

Connecting Content Creation to Ad Making

Here's another example of what A2A enables in e-commerce: Agentic Digital Asset Management.

The problem is obvious to anyone managing digital assets for e-commerce: we're buried in grunt work.

❝

E-commerce marketers currently spend countless hours copying video urls into Google Sheets, uploading them to their digital asset management systems (DAM), manually tagging content, and praying someone will actually find and use those assets later.

This manual shuffling is tedious and completely breaks down at scale. When an influencer posts about your product, you want that content immediately available for ad creation. A2A might finally solve this.

Looking at the S3 bucket for just one of our recent influencer videos, our Scene Prep Aigent has automatically generated:

5 individual scene MP4 files
24 extracted keyframes as JPG files
A detailed analysis_results.json file with scene-by-scene descriptions
A scenes.json file mapping all time codes and relationships
A frames.json file documenting all extracted frames
An audio transcript of what is being said on the video

Each scene's analysis is incredibly detailed. For example, Scene 1's description includes:

"A person is prominently featured, likely engaging with the viewer. Her long, blonde hair is noticeable, and her facial expression alternates from neutral to engaged... A jar of cream is shown in the final frames, prominently displayed by the person. It has a blue lid and white body.

This structured data gets vectorized in Pinecone, making it semantically searchable. The question is: how does this powerful asset preparation connect to actual ad creation?

Enter A2A: The Missing Protocol

A2A creates a standardized way for specialized agents to find each other and work together without human handholding. Here's how it works between our Content Monitor Agent and Scene Detection Agent:

Step 1: Agent Discovery

When a new Instagram post appears, the Content Monitor Agent needs to find a specialized video processing service. It discovers our Scene Detection Agent through its Agent Card - essentially a capabilities menu:

{ "name": "Scene Detection Agent", "description": "Detects scenes in video content and provides detailed analysis", "skills": [ { "id": "detect-scenes", "description": "Detects scene changes in video content", "tags": ["video", "scenes", "detection"] } ] }

Step 2: Task Creation

The Content Monitor Agent uploads the video to S3 and delegates processing to the Scene Detection Agent:

{ "jsonrpc": "2.0", "method": "tasks/send", "params": { "id": "task-12345", "message": { "role": "user", "parts": [{ "type": "text", "text": "{"video_url": "https://example.com/video.mp4"}" }] }, "metadata": { "AWS_ACCESS_KEY": "[REDACTED]", "S3_BUCKET_NAME": "assets", "webhook_url": "https://example.com/webhook" } } }

Step 3: Real-Time Processing Updates

As the Scene Detection Agent processes the video, it sends progress notifications:

{ "jsonrpc": "2.0", "method": "task.update", "params": { "id": "task-12345", "status": { "state": "working", "message": { "parts": [{ "text": "Downloading video", "type": "text" }] } } } }

The agent sends updates at each stage - scene detection (finding 5 scenes), frame extraction (creating 24 keyframes), AI vision analysis (interpreting content), and vectorization.

Step 4: Final Results

When complete, the Scene Detection Agent delivers structured results, including S3 locations of all generated assets:

{ "jsonrpc": "2.0", "method": "task.update", "params": { "id": "task-12345", "status": { "state": "completed", "message": { "parts": [{ "text": "Scene detection complete. Results available at: s3://assets/video/scenes.json", "type": "text" }] } }, "artifacts": [{ "name": "scene-detection-result", "parts": [{ "type": "text", "text": "{"video_url": "https://example.com/video.mp4", "scene_count": 3, "scenes": [...]}" }] }] } }

Now the Content Monitor Agent can immediately notify AdMaker Agents that new, fully analyzed creative assets are available, all without human intervention.

The Fundamental Shift

This is a fundamental restructuring of how digital assets flow through e-commerce systems. The contrast between traditional and A2A-enabled workflows is stark:

Traditional Manual Process

Multi-Agent Process

No spreadsheets. No emails. No manual tagging. No human shuffling files between systems.

Better Than Manual Processes

The natural question is: "Why not just have humans do this work? We've always manually copied files and updated spreadsheets."

Having built both systems, I can tell you precisely why A2A beats manual processes:

Manual vs A2A Process Comparison

Traditional Manual Process	A2A-Enabled Process
Time Investment	Time Investment
Social media managers spend hours monitoring feeds	Content Monitor Agent continuously scans platforms 24/7
Team members watch entire videos (may take 2-3x video length)	Scene Detection Agent processes videos in seconds to minutes
Manual tagging takes 10-15 minutes per minute of video	AI vision analysis performed instantly on extracted frames
Email/Slack coordination adds hours to days of delay	A2A communications happen in seconds
Quality & Consistency	Quality & Consistency
Limited tags (typically 10-20 per video)	Rich semantic descriptions (hundreds of data points per video)
Subjective tagging varies between team members	Consistent AI analysis using standardized criteria
Limited detail (generic tags like "product demo")	Detailed analysis ("blue lid with Tighten & Lift label")
No emotional or cinematic analysis	Captures mood, lighting, cinematography, emotional qualities
Searchability	Searchability
Keyword-only search	Semantic vector search
Must match exact tags	Can find conceptually similar content
No search for visual elements or qualities	Can search for specific visual attributes
Limited to pre-defined taxonomy	Open-ended natural language queries
Asset Management	Asset Management
Single video files	Automatically extracted scenes and keyframes
Manual screenshots if needed	Systematically extracted representative frames
No structured scene data	Complete scene breakdown with timecodes
File-based organization	Semantically structured database
Resource Requirements	Resource Requirements
500 videos ≈ 150-200 human hours annually	500 videos ≈ automated processing (minimal human oversight)
Scales linearly with content volume	Scales automatically with content volume
Limited by human availability	Operates 24/7 without fatigue
Knowledge lost when staff changes	Consistent system regardless of staffing

In our example the Scene Prep Aigent produced 24 frames, 5 scene files, and detailed semantic analysis - for a single 30-second video. Humans typically extract 2-3 screenshots and write generic tags like "product demo."

The difference in asset quality is staggering. Our analysis captures precise details like "the product has a blue lid and white body with detailed text on label " - details that would be lost in manual tagging.

Building Your Agentic DAM

Implementing a multi-agent DAM system isn't as complex as it seems.

Content Monitoring: Watching creator platforms for brand mentions
Asset Processing: Converting raw content into structured data (like our Scene Prep agent)
Creative Production: Generating ad variations from processed assets
Performance Optimization: Analyzing results and adjusting campaigns

Data Ownership Is Critical

As I emphasized in my Scene Prep article, "the brands that own and control their own semantic data will have an insurmountable advantage over those who've surrendered it to third-party platforms."

Our Scene Prep agent stores all processed assets and vector data in the brand's own infrastructure (S3 buckets and Pinecone databases). This gives brands complete control while enabling secure agent collaboration.

What's Next

The emerging Agent2Agent ecosystem will transform how e-commerce brands handle digital assets. Rather than building increasingly complex monolithic systems, we'll see networks of specialized agents that excel at specific tasks.

The most exciting applications combine multiple specialized capabilities:

Creator content immediately transformed into product listings
Video assets automatically adapted for different ad platforms
Product details instantly updated across all marketing assets

Looking at the Scene Prep Aigent output with its perfectly structured scene files, frame extractions, and semantic analysis, we now have a far more effective DAM system.

A2A provides the final piece: the standard protocol that lets these specialized agents talk to each other without human intervention.

Saying "have your agent call my agent" is no longer a futuristic concept. It's the new e-commerce reality.

Want access to aigencia’s multi-agent DAM system? DM Bora on LinkedIn.