There are no new ideas in AI, only new datasets

Written by Lucas Hendrich | Jul 2, 2025

Why data engineering and modern architecture are the cornerstones of enterprise AI success

In the current phase of AI adoption, many technology leaders remain focused on algorithmic novelty, often searching for a competitive edge through model optimization. However, recent industry trends and empirical evidence suggest that the true differentiator in large-scale AI systems is not the model itself, but the underlying data.

This principle is increasingly echoed across leading AI research. As Jack Morris argues in a recent post, “There are no new ideas in AI, only new datasets.” At Forte Group, we see this confirmed daily in our client work. The organizations that succeed with AI are not necessarily those who fine-tune the best models, but those who engineer systems capable of ingesting, organizing, and governing diverse and high-volume data at scale.

The upper bound of model performance

One of the most overlooked constraints in enterprise AI is what Morris calls the “upper bound” on performance: when a model has extracted all useful signal from a dataset, further improvements plateau. Research by OpenAI and DeepMind has repeatedly confirmed this: performance gains from new model architectures tend to diminish when data volume and diversity are held constant.

For example, the performance gap between GPT-3 and GPT-4 is marginal in several downstream tasks, despite the latter's architectural complexity. The largest improvements came not from novel training techniques, but from increased dataset size and improved modality coverage.

In practical terms, this means that enterprises investing in model iteration without investing in data pipelines and architecture are likely to experience declining returns.

Emerging frontiers: multimodal and sensor data

Most current LLMs and vision models rely on well-established datasets: Common Crawl, ImageNet, and similar. These are largely saturated.

The next wave of innovation will come from underexploited data modalities:

Video: Over 500 hours of video are uploaded to YouTube every minute. Yet training data for video understanding remains limited due to processing complexity and storage costs.
Sensor Data and Robotics: Real-time telemetry from industrial sensors, autonomous systems, and IoT devices remains vastly underutilized. According to McKinsey, less than 1% of generated IoT data is currently used for decision-making.

Enterprises with the ability to harness these modalities will unlock disproportionate value—not by building new models, but by structuring pipelines to access and learn from data others cannot.

The role of modern data architecture

To capitalize on these opportunities, organizations must first invest in scalable, secure, and composable data platforms. At Forte Group, we advise clients to prioritize four capabilities:

Data Ingestion at Scale: Systems must support high-throughput, multimodal ingestion—streaming video, audio, sensor data, and structured enterprise sources in near real-time.
Metadata and Governance Layers: With data sprawl comes risk. Strong metadata cataloging, lineage tracking, and access controls are essential to maintaining compliance and trust.
Lakehouse and Delta Architectures: Hybrid models combining the structure of warehouses with the flexibility of lakes (e.g., Databricks’ Delta Lake or Snowflake’s Unistore) enable organizations to manage evolving data types without compromising performance.
Embedded MLOps: Data without deployment is a sunk cost. Reproducible training pipelines, automated evaluation, and scalable serving infrastructure are essential to closing the loop between data collection and decision-making.

When these elements are in place, organizations gain not just AI capabilities, but adaptive intelligence—the ability to continually learn from the environment, customers, operations, and products.

Why this is a sound investment

From a financial perspective, data engineering and platform modernization offer a superior return on investment compared to one-off model development:

According to Gartner, 80% of AI project costs are related to data sourcing, integration, and infrastructure—not model training.
A 2023 report by Accenture found that enterprises with modern data foundations are 2.3x more likely to achieve measurable business outcomes from AI initiatives.
In our own delivery programs, clients that adopted cloud-native, governed data platforms saw a 30–50% reduction in time-to-insight, with new datasets made accessible to internal teams within days, not months.

A strategit imperative for CTOs and CIOs

The conclusion is clear: if you are still approaching AI as a model-centric endeavor, you are operating at a disadvantage. Data—not code—is the substrate from which intelligence emerges. Therefore, engineering access to new and diverse data types is not an infrastructure task—it is a strategic imperative.

As AI expands into domains like embedded systems, real-time analytics, and agent-based decision-making, the organizations that thrive will be those with the foresight to invest in data architecture now.

If your team is exploring how to modernize your data stack or evaluate readiness for AI adoption, Forte Group can help. Our approach combines pragmatic delivery with enterprise-grade governance—ensuring you do not just pilot AI, but integrate it sustainably into your business.

View full post