Agent2Agent in Ecom

Google's new A2A protocol makes "have your agent call my agent" possible.

I've spent the last few months obsessing over how to make videos work for AI-powered ad creation. After building our Scene Prep Aigent at aigencia that turns raw videos into structured semantic data, I'm seeing how specialized AI agents could talk directly to each other using Google's new Agent2Agent (A2A) protocol.

From Product Image to Dynamic Ads with A2A

A powerful example of A2A's capabilities is transforming simple product images into engaging ads, both static and animated:

  • We start with a brand's basic product image, handled by our first specialized agent (Agent 1: GPT-Image-1), which turns the product image into a high quality static ad.

  • Through Google's Agent-to-Agent (A2A) protocol, this static ad image is seamlessly passed to another specialized creative agent (Agent 2: Runway Gen4 Turbo).

  • Runway Gen4 Turbo then takes the ad image to create a visually compelling animated video ad.

By automating the end-to-end creative process from product image to finalized ad creatives, A2A dramatically reduces turnaround time from days to mere minutes, enabling brands to rapidly scale high-impact ad campaigns.

Connecting Content Creation to Ad Making

Here's another example of what A2A enables in e-commerce: Agentic Digital Asset Management.

The problem is obvious to anyone managing digital assets for e-commerce: we're buried in grunt work.

E-commerce marketers currently spend countless hours copying video urls into Google Sheets, uploading them to their digital asset management systems (DAM), manually tagging content, and praying someone will actually find and use those assets later.

This manual shuffling is tedious and completely breaks down at scale. When an influencer posts about your product, you want that content immediately available for ad creation. A2A might finally solve this.

Looking at the S3 bucket for just one of our recent influencer videos, our Scene Prep Aigent has automatically generated:

  • 5 individual scene MP4 files

  • 24 extracted keyframes as JPG files

  • A detailed analysis_results.json file with scene-by-scene descriptions

  • A scenes.json file mapping all time codes and relationships

  • A frames.json file documenting all extracted frames

  • An audio transcript of what is being said on the video

Each scene's analysis is incredibly detailed. For example, Scene 1's description includes:

"A person is prominently featured, likely engaging with the viewer. Her long, blonde hair is noticeable, and her facial expression alternates from neutral to engaged... A jar of cream is shown in the final frames, prominently displayed by the person. It has a blue lid and white body.

This structured data gets vectorized in Pinecone, making it semantically searchable. The question is: how does this powerful asset preparation connect to actual ad creation?

Enter A2A: The Missing Protocol

A2A creates a standardized way for specialized agents to find each other and work together without human handholding. Here's how it works between our Content Monitor Agent and Scene Detection Agent:

Step 1: Agent Discovery

When a new Instagram post appears, the Content Monitor Agent needs to find a specialized video processing service. It discovers our Scene Detection Agent through its Agent Card - essentially a capabilities menu:

{ "name": "Scene Detection Agent", "description": "Detects scenes in video content and provides detailed analysis", "skills": [ { "id": "detect-scenes", "description": "Detects scene changes in video content", "tags": ["video", "scenes", "detection"] } ] }

Step 2: Task Creation

The Content Monitor Agent uploads the video to S3 and delegates processing to the Scene Detection Agent:

{ "jsonrpc": "2.0", "method": "tasks/send", "params": { "id": "task-12345", "message": { "role": "user", "parts": [{ "type": "text", "text": "{"video_url": "https://example.com/video.mp4"}" }] }, "metadata": { "AWS_ACCESS_KEY": "[REDACTED]", "S3_BUCKET_NAME": "assets", "webhook_url": "https://example.com/webhook" } } }

Step 3: Real-Time Processing Updates

As the Scene Detection Agent processes the video, it sends progress notifications:

{ "jsonrpc": "2.0", "method": "task.update", "params": { "id": "task-12345", "status": { "state": "working", "message": { "parts": [{ "text": "Downloading video", "type": "text" }] } } } }

The agent sends updates at each stage - scene detection (finding 5 scenes), frame extraction (creating 24 keyframes), AI vision analysis (interpreting content), and vectorization.

Step 4: Final Results

When complete, the Scene Detection Agent delivers structured results, including S3 locations of all generated assets:

{ "jsonrpc": "2.0", "method": "task.update", "params": { "id": "task-12345", "status": { "state": "completed", "message": { "parts": [{ "text": "Scene detection complete. Results available at: s3://assets/video/scenes.json", "type": "text" }] } }, "artifacts": [{ "name": "scene-detection-result", "parts": [{ "type": "text", "text": "{"video_url": "https://example.com/video.mp4", "scene_count": 3, "scenes": [...]}" }] }] } }

Now the Content Monitor Agent can immediately notify AdMaker Agents that new, fully analyzed creative assets are available, all without human intervention.

The Fundamental Shift

This is a fundamental restructuring of how digital assets flow through e-commerce systems. The contrast between traditional and A2A-enabled workflows is stark:

Traditional Manual Process

Multi-Agent Process

No spreadsheets. No emails. No manual tagging. No human shuffling files between systems.

Better Than Manual Processes

The natural question is: "Why not just have humans do this work? We've always manually copied files and updated spreadsheets."

Having built both systems, I can tell you precisely why A2A beats manual processes:

Manual vs A2A Process Comparison

Traditional Manual Process

A2A-Enabled Process

Time Investment

Time Investment

Social media managers spend hours monitoring feeds

Content Monitor Agent continuously scans platforms 24/7

Team members watch entire videos (may take 2-3x video length)

Scene Detection Agent processes videos in seconds to minutes

Manual tagging takes 10-15 minutes per minute of video

AI vision analysis performed instantly on extracted frames

Email/Slack coordination adds hours to days of delay

A2A communications happen in seconds

Quality & Consistency

Quality & Consistency

Limited tags (typically 10-20 per video)

Rich semantic descriptions (hundreds of data points per video)

Subjective tagging varies between team members

Consistent AI analysis using standardized criteria

Limited detail (generic tags like "product demo")

Detailed analysis ("blue lid with Tighten & Lift label")

No emotional or cinematic analysis

Captures mood, lighting, cinematography, emotional qualities

Searchability

Searchability

Keyword-only search

Semantic vector search

Must match exact tags

Can find conceptually similar content

No search for visual elements or qualities

Can search for specific visual attributes

Limited to pre-defined taxonomy

Open-ended natural language queries

Asset Management

Asset Management

Single video files

Automatically extracted scenes and keyframes

Manual screenshots if needed

Systematically extracted representative frames

No structured scene data

Complete scene breakdown with timecodes

File-based organization

Semantically structured database

Resource Requirements

Resource Requirements

500 videos ≈ 150-200 human hours annually

500 videos ≈ automated processing (minimal human oversight)

Scales linearly with content volume

Scales automatically with content volume

Limited by human availability

Operates 24/7 without fatigue

Knowledge lost when staff changes

Consistent system regardless of staffing

In our example the Scene Prep Aigent produced 24 frames, 5 scene files, and detailed semantic analysis - for a single 30-second video. Humans typically extract 2-3 screenshots and write generic tags like "product demo."

The difference in asset quality is staggering. Our analysis captures precise details like "the product has a blue lid and white body with detailed text on label " - details that would be lost in manual tagging.

Building Your Agentic DAM

Implementing a multi-agent DAM system isn't as complex as it seems.

  1. Content Monitoring: Watching creator platforms for brand mentions

  2. Asset Processing: Converting raw content into structured data (like our Scene Prep agent)

  3. Creative Production: Generating ad variations from processed assets

  4. Performance Optimization: Analyzing results and adjusting campaigns

Data Ownership Is Critical

As I emphasized in my Scene Prep article, "the brands that own and control their own semantic data will have an insurmountable advantage over those who've surrendered it to third-party platforms."

Our Scene Prep agent stores all processed assets and vector data in the brand's own infrastructure (S3 buckets and Pinecone databases). This gives brands complete control while enabling secure agent collaboration.

What's Next

The emerging Agent2Agent ecosystem will transform how e-commerce brands handle digital assets. Rather than building increasingly complex monolithic systems, we'll see networks of specialized agents that excel at specific tasks.

The most exciting applications combine multiple specialized capabilities:

  • Creator content immediately transformed into product listings

  • Video assets automatically adapted for different ad platforms

  • Product details instantly updated across all marketing assets

Looking at the Scene Prep Aigent output with its perfectly structured scene files, frame extractions, and semantic analysis, we now have a far more effective DAM system.

A2A provides the final piece: the standard protocol that lets these specialized agents talk to each other without human intervention.

Saying "have your agent call my agent" is no longer a futuristic concept. It's the new e-commerce reality.

Want access to aigencia’s multi-agent DAM system? DM Bora on LinkedIn.