- Agentic Brand
- Posts
- Agent2Agent in Ecom
Agent2Agent in Ecom
Google's new A2A protocol makes "have your agent call my agent" possible.
I've spent the last few months obsessing over how to make videos work for AI-powered ad creation. After building our Scene Prep Aigent at aigencia that turns raw videos into structured semantic data, I'm seeing how specialized AI agents could talk directly to each other using Google's new Agent2Agent (A2A) protocol.
From Product Image to Dynamic Ads with A2A
A powerful example of A2A's capabilities is transforming simple product images into engaging ads, both static and animated:
We start with a brand's basic product image, handled by our first specialized agent (Agent 1: GPT-Image-1), which turns the product image into a high quality static ad.
Through Google's Agent-to-Agent (A2A) protocol, this static ad image is seamlessly passed to another specialized creative agent (Agent 2: Runway Gen4 Turbo).
Runway Gen4 Turbo then takes the ad image to create a visually compelling animated video ad.
By automating the end-to-end creative process from product image to finalized ad creatives, A2A dramatically reduces turnaround time from days to mere minutes, enabling brands to rapidly scale high-impact ad campaigns.
Connecting Content Creation to Ad Making
Here's another example of what A2A enables in e-commerce: Agentic Digital Asset Management.
The problem is obvious to anyone managing digital assets for e-commerce: we're buried in grunt work.
E-commerce marketers currently spend countless hours copying video urls into Google Sheets, uploading them to their digital asset management systems (DAM), manually tagging content, and praying someone will actually find and use those assets later.
This manual shuffling is tedious and completely breaks down at scale. When an influencer posts about your product, you want that content immediately available for ad creation. A2A might finally solve this.
Looking at the S3 bucket for just one of our recent influencer videos, our Scene Prep Aigent has automatically generated:
5 individual scene MP4 files
24 extracted keyframes as JPG files
A detailed analysis_results.json file with scene-by-scene descriptions
A scenes.json file mapping all time codes and relationships
A frames.json file documenting all extracted frames
An audio transcript of what is being said on the video
Each scene's analysis is incredibly detailed. For example, Scene 1's description includes:
"A person is prominently featured, likely engaging with the viewer. Her long, blonde hair is noticeable, and her facial expression alternates from neutral to engaged... A jar of cream is shown in the final frames, prominently displayed by the person. It has a blue lid and white body.
This structured data gets vectorized in Pinecone, making it semantically searchable. The question is: how does this powerful asset preparation connect to actual ad creation?
Enter A2A: The Missing Protocol
A2A creates a standardized way for specialized agents to find each other and work together without human handholding. Here's how it works between our Content Monitor Agent and Scene Detection Agent:
Step 1: Agent Discovery
When a new Instagram post appears, the Content Monitor Agent needs to find a specialized video processing service. It discovers our Scene Detection Agent through its Agent Card - essentially a capabilities menu:
{ "name": "Scene Detection Agent", "description": "Detects scenes in video content and provides detailed analysis", "skills": [ { "id": "detect-scenes", "description": "Detects scene changes in video content", "tags": ["video", "scenes", "detection"] } ] }
Step 2: Task Creation
The Content Monitor Agent uploads the video to S3 and delegates processing to the Scene Detection Agent:
{ "jsonrpc": "2.0", "method": "tasks/send", "params": { "id": "task-12345", "message": { "role": "user", "parts": [{ "type": "text", "text": "{"video_url": "https://example.com/video.mp4"}" }] }, "metadata": { "AWS_ACCESS_KEY": "[REDACTED]", "S3_BUCKET_NAME": "assets", "webhook_url": "https://example.com/webhook" } } }
Step 3: Real-Time Processing Updates
As the Scene Detection Agent processes the video, it sends progress notifications:
{ "jsonrpc": "2.0", "method": "task.update", "params": { "id": "task-12345", "status": { "state": "working", "message": { "parts": [{ "text": "Downloading video", "type": "text" }] } } } }
The agent sends updates at each stage - scene detection (finding 5 scenes), frame extraction (creating 24 keyframes), AI vision analysis (interpreting content), and vectorization.
Step 4: Final Results
When complete, the Scene Detection Agent delivers structured results, including S3 locations of all generated assets:
{ "jsonrpc": "2.0", "method": "task.update", "params": { "id": "task-12345", "status": { "state": "completed", "message": { "parts": [{ "text": "Scene detection complete. Results available at: s3://assets/video/scenes.json", "type": "text" }] } }, "artifacts": [{ "name": "scene-detection-result", "parts": [{ "type": "text", "text": "{"video_url": "https://example.com/video.mp4", "scene_count": 3, "scenes": [...]}" }] }] } }
Now the Content Monitor Agent can immediately notify AdMaker Agents that new, fully analyzed creative assets are available, all without human intervention.
The Fundamental Shift
This is a fundamental restructuring of how digital assets flow through e-commerce systems. The contrast between traditional and A2A-enabled workflows is stark:
Traditional Manual Process


Multi-Agent Process


No spreadsheets. No emails. No manual tagging. No human shuffling files between systems.
Better Than Manual Processes
The natural question is: "Why not just have humans do this work? We've always manually copied files and updated spreadsheets."
Having built both systems, I can tell you precisely why A2A beats manual processes:
Manual vs A2A Process Comparison
Traditional Manual Process | A2A-Enabled Process |
---|---|
Time Investment | Time Investment |
Social media managers spend hours monitoring feeds | Content Monitor Agent continuously scans platforms 24/7 |
Team members watch entire videos (may take 2-3x video length) | Scene Detection Agent processes videos in seconds to minutes |
Manual tagging takes 10-15 minutes per minute of video | AI vision analysis performed instantly on extracted frames |
Email/Slack coordination adds hours to days of delay | A2A communications happen in seconds |
Quality & Consistency | Quality & Consistency |
Limited tags (typically 10-20 per video) | Rich semantic descriptions (hundreds of data points per video) |
Subjective tagging varies between team members | Consistent AI analysis using standardized criteria |
Limited detail (generic tags like "product demo") | Detailed analysis ("blue lid with Tighten & Lift label") |
No emotional or cinematic analysis | Captures mood, lighting, cinematography, emotional qualities |
Searchability | Searchability |
Keyword-only search | Semantic vector search |
Must match exact tags | Can find conceptually similar content |
No search for visual elements or qualities | Can search for specific visual attributes |
Limited to pre-defined taxonomy | Open-ended natural language queries |
Asset Management | Asset Management |
Single video files | Automatically extracted scenes and keyframes |
Manual screenshots if needed | Systematically extracted representative frames |
No structured scene data | Complete scene breakdown with timecodes |
File-based organization | Semantically structured database |
Resource Requirements | Resource Requirements |
500 videos ≈ 150-200 human hours annually | 500 videos ≈ automated processing (minimal human oversight) |
Scales linearly with content volume | Scales automatically with content volume |
Limited by human availability | Operates 24/7 without fatigue |
Knowledge lost when staff changes | Consistent system regardless of staffing |
In our example the Scene Prep Aigent produced 24 frames, 5 scene files, and detailed semantic analysis - for a single 30-second video. Humans typically extract 2-3 screenshots and write generic tags like "product demo."
The difference in asset quality is staggering. Our analysis captures precise details like "the product has a blue lid and white body with detailed text on label " - details that would be lost in manual tagging.
Building Your Agentic DAM
Implementing a multi-agent DAM system isn't as complex as it seems.
Content Monitoring: Watching creator platforms for brand mentions
Asset Processing: Converting raw content into structured data (like our Scene Prep agent)
Creative Production: Generating ad variations from processed assets
Performance Optimization: Analyzing results and adjusting campaigns
Data Ownership Is Critical
As I emphasized in my Scene Prep article, "the brands that own and control their own semantic data will have an insurmountable advantage over those who've surrendered it to third-party platforms."
Our Scene Prep agent stores all processed assets and vector data in the brand's own infrastructure (S3 buckets and Pinecone databases). This gives brands complete control while enabling secure agent collaboration.
What's Next
The emerging Agent2Agent ecosystem will transform how e-commerce brands handle digital assets. Rather than building increasingly complex monolithic systems, we'll see networks of specialized agents that excel at specific tasks.
The most exciting applications combine multiple specialized capabilities:
Creator content immediately transformed into product listings
Video assets automatically adapted for different ad platforms
Product details instantly updated across all marketing assets
Looking at the Scene Prep Aigent output with its perfectly structured scene files, frame extractions, and semantic analysis, we now have a far more effective DAM system.
A2A provides the final piece: the standard protocol that lets these specialized agents talk to each other without human intervention.
Saying "have your agent call my agent" is no longer a futuristic concept. It's the new e-commerce reality.