The architecture of AI information systems is evolving rapidly, and we’re witnessing a critical shift that will reshape how websites operate. After a few weeks of analysis and experimentation, I’ve identified three distinct architectural paradigms that solve progressively more complex problems, and the implications for publishers, e-commerce managers and website owners are significant.
The Three-Phase Evolution: From Static Knowledge to Dynamic Reasoning
Phase 1: Foundational RAG (Retrieval-Augmented Generation)
The first phase tackled what I call the LLM’s “static knowledge problem.” By linking models to external vector databases—effectively extending their memory—RAG reduced hallucinations and kept answers current. A Web Index from providers like Bing or Google became essential, allowing models to draw from broader internet snapshots. Yet limitations persisted: RAG couldn’t query live systems, handle temporal questions effectively, or deliver precise results for complex, multi-constraint requests (e.g., “All horror movies filmed in Italy in 2023” or “The best Montepulciano d’Abruzzo wines from 2021 under €25”).
Phase 2: Agentic Retrieval
The second phase solved the “dynamic knowledge problem” through a sophisticated two-step process revealed by my analysis of frontier models like GPT-5:
- Search action returns snippets rich in pre-digested metadata—authors and dates (arXiv), release versions (GitHub), event details, recipe yields.
- Metadata-based decision on which URLs to open for deeper reading.
This represents a shift from “prompting with data” to “prompting with a reference to data.”
Phase 3: Multi-Agent Systems
The current frontier tackles the “complexity problem“—queries requiring multi-hop reasoning across heterogeneous sources. Architectures like Baidu’s TURA framework use a “Planner” agent to decompose tasks into a DAG (Directed Acyclic Graph), executed by specialized agent teams. This enables parallel, collaborative problem-solving that mirrors human research methodologies.

Behind the Curtain: How Modern AI Retrieves Information
My testing of GPT-5’s web search capabilities (as well as Dan Petrovic testing on Gemini’s search tools) reveals sophisticated metadata extraction that goes far beyond text scraping.
Testing Recipe Content: When I queried for “tiramisu recipe,” GPT-5’s search tool returned rich metadata directly in snippets:
- Author names and publication dates
- Recipe yields and preparation times
- Ingredient lists and instruction previews
- Source credibility indicators
Cross-Content Analysis: Testing across different content types revealed systematic metadata extraction:
Content Type | Metadata Surfaced | Example |
Scientific Papers | Authors, dates, abstracts, citation counts | arXiv papers with full author lists and submission dates |
GitHub Repositories | Release versions, feature highlights, install commands | “v1.5.0 features” and “pip install” snippets |
Apps | Ratings, download counts, developer info | “3.9 stars, 50M+ downloads, Niantic Inc.” |
Government Data | Publishers, file formats, update dates, licenses | “Updated: Aug 2025, Format: JSON/Excel, Publisher: Bureau of Labor Statistics” |
The Key Insight: In a separated test on TripAdvisor, using OpenAI’s GPT-OSS-120B, the model identified a schema:Restaurant
entity with nested properties, ratings, and reviews—clear evidence that retrieval systems surface structured metadata for AI use.
But let’s be precise: the LLM doesn’t access structured data or raw HTML directly; it receives a sanitized snippet from the retrieval layer and, if it “opens” a page, a synthesized representation rather than the full source.
Here is the metadata observed by GPT-5 when the web.search
tool is invoked on a recipe website.
Metadata Field | Example in Snippet |
---|---|
Author | Giada De Laurentiis, Rick Rodgers |
Date Published/Updated | March 31 2006, December 6 2023 |
Recipe Yield | “Makes 8 servings”, “4 Servings” |
Ingredients Mention | Yes — partial lists or key items |
Descriptive Summary | Quick ingredient notes or style variations |
Tags/Keywords | Often footnotes of recipe categories |
Search Engine Routing: The testing revealed that different queries trigger different underlying search engines:
Google-style indicators: “People also ask” phrasing, arXiv citation counts, detailed research metadata, dataset licensing information
Bing-style indicators: Aggressive date formatting, rich inline author names, GitHub release tags, “Top 10” listicle formats
This aligns with Aleyda Solis’s research showing ChatGPT’s reliance on Google SERP snippets, though the routing appears more nuanced than single-provider dependency.
Why Structured Data Is Now Critical
My experiments with GPT-OSS-120B and GPT-5 confirm a fundamental shift: AI models are moving from processing text to interpreting structured data. When I queried for “Gluten-Free Pizza in Trastevere,” the model synthesized a comprehensive knowledge panel with structured tables and verifiable source provenance rather than returning simple links.
The model processes a page’s explicit knowledge graph, not just its unstructured text.
This leads to two strategic imperatives:
- Entities over Keywords: AI retrieves “things” (entities with attributes), not “strings” (keywords). Success depends on providing machine-readable data that clearly describes these entities.
- Structured Data as a Grounding Protocol: Schema.org in JSON-LD is no longer just for Google’s rich snippets—it’s the primary protocol for providing factual, verifiable grounding to LLMs and AI agents.
Practical takeaway for publishers:
The metadata visible in search snippets—author names, publication dates, ratings, prices—comes directly from your structured data. Sites with comprehensive schema markup appear accurately in AI responses; those without risk being misunderstood or ignored entirely.
Building Agent-Ready Websites
The economic data tells the story: In Q1 2025, AI bot traffic across the TollBit network (a monetization provider for AI traffic) nearly doubled (+87%), with RAG bot scrapes rising 49%. Yet AI apps accounted for just 0.04% of external referral traffic versus Google’s 85%.
An agent-ready website transitions from passive document repository to active, queryable knowledge source, offering specific tools for AI agents:
- Entity Search Endpoints: Allow agents to perform disambiguated lookups using unique entity IDs
- Semantic Content Search: Enable faceted searches based on underlying entities and topics
- Relationship Extraction: Permit agents to query connections between entities
- GS1 Digital Link Resolvers: Essential for e-commerce, providing real-time product data
To assess your site’s current readiness for AI agents, use our AI SEO Audit Tool (still in beta testing) to evaluate your structured data implementation and identify optimization opportunities.
The Economic Reality: From Threat to Revenue Stream
The rise of centralized AI “answer engines” challenges publishers when Google’s AI Overviews synthesize content without driving traffic. However, by implementing structured data protocols and agent-ready infrastructure, publishers can shift from being passively scraped to actively providing licensed data via reliable APIs.
Platforms like TollBit and emerging Cloudflare solutions enable publishers to charge AI agents per query while keeping human access free. This transforms AI scraping from threat to direct revenue stream.
WordLift’s Role in the Agentic Web
At WordLift, we recognized this shift early. While others focused on building better AI models, we’ve been building the infrastructure layer that makes the web truly queryable:
- Comprehensive entity recognition and knowledge graph construction
- Schema.org markup automation at scale
- API endpoints for semantic search and entity relationship queries
- Integration with emerging protocols like Model Context Protocol (MCP)
- Agentic SEO solutions for automated marketing tasks
Through our MCP configuration, we’re enabling websites to serve as live data endpoints powering AI workflows. What was once purely a threat is now a dual opportunity: a data-centric web driving marketing efficiency and the foundation for agent-driven commerce and content monetization.
Underpinning this evolution is structured data—the rich metadata enabling intelligent agent behavior. As reasoning demands become more relational, the future belongs to GraphRAG: retrieving directly from knowledge graphs that provide cognitive scaffolding for reliable, complex reasoning.
What This Means for Your Business
The question for every digital business is: when an AI agent queries your domain, will it find a flat document to parse, or a rich database to interrogate? Will it be even able to access your website?
The SEO community has the tools, expertise, and responsibility to shape this agentic web. By leading on structured data standards, building API-first content systems, and negotiating fair access for AI agents, we can ensure this shift benefits publishers, brands, and users—human or machine.
The publishers who succeed will be those who act now to:
- Establish agent-accessible APIs
- Implement comprehensive structured data markup
Create machine-readable knowledge layers
The agentic web is already here. It’s on us to build it.
The post From Retrieval to Reasoning: The Architectural Evolution of Information Systems for Large Language Models appeared first on WordLift Blog.