Why Orchestration and DataOps Will Redefine the Modern Data Stack
Over the past decade, a new data stack has emerged. Traditional systems were disrupted by the relentless growth in data, the rise of cloud computing, and increasingly complex use cases. Disruption was inevitable because legacy systems failed to deliver the scalability, economics, and interoperability modern businesses demanded.
Snowflake and Databricks stand out as leaders in the new data stack. They built multi-billion ARR businesses by offering elastically scalable, on-demand data compute platforms with exceptional user experiences. A big part of their success stems from complementing the cloud hyperscalers rather than competing with them, enabling a composable and often heterogenous tech stack that enterprises prefer.
However, platforms for data compute only address one layer of a deep, fragmented and constantly-evolving stack. Now, if enterprises are going to actually see ROI from production AI or truly leverage their data for competitive advantage, the higher layers of the data stack need to undergo their own disruption. We’ve already seen this beginning to play out.
DataOps: Redefining the data stack
The DataOps layer sits above compute. Its purpose is operationalizing data, transforming raw inputs into data products ready for consumption. It’s how you turn your troves of hard-earned data into something that actually helps your business and delights your customers. Most importantly, DataOps involves orchestrating complex data pipelines and workflows that handle critical tasks like data ingestion, integration, transformation, ML/AI processes, and more. All of which has to be augmented with separate tooling to take care of the essential controls around that data — discovery, integration, observability, quality monitoring, and data governance.
Figure 1: Fragmentation above the compute layer prevents businesses unlocking value from data
But right now, this part of the data stack is a bit of a mess. Remember the rise and subsequent think pieces around the collapse of the Modern Data Stack? Thanks in part to the startup boom fueled by 0% interest rates, it’s overloaded with vendors and fragmented tools—each handling just a tiny slice of the data lifecycle with very few that actually play well together. For enterprises, this is more than just a pain. It means frustration and friction. From simple dashboards to cutting-edge AI, a disjointed DataOps stack hamstrings their ability to turn data into a competitive advantage.
We don’t need to read market analysis to understand the impact. We talk to enough customers to know the majority of AI investments are getting stuck in prototyping, never making it to production or delivering ROI. The fallout is massive: budgets spiraling out of control, missed opportunities to act on transformative ideas, and no clear understanding of the value data products are supposed to bring. The promise of data is there, but for too many enterprises, it’s slipping out of reach.
Enterprises and the data teams should not have to put up with this situation much longer. The good news is that they won’t have to. A cohesive and streamlined data stack is coming in the form of the winning unified DataOps platform. And the companies that are getting unified DataOps right today are running circles around their competitors, without being beholden to any one tech vendor.
What do you need for DataOps?
The DataOps space is crowded, with contenders from every corner vying for the crown. While each category of tools plays an important role, the question remains: do they really unify the data stack and provide the operating system for DataOps the way enterprises need? Let’s break it down:
- Data Cataloging Tools: Data catalogs (e.g. Atlan / Alation / Collibra) give a passive view of your data—great for “reporting the news,” but limited for actually managing or operationalizing data at scale. To even “report the news,” cataloging tools rely on tight dependencies with systems like warehouses and workflow orchestration, not to mention cross-team buy-in and adoption. This often creates roadblocks. If DataOps unification is the goal, cataloging tools are not the best place to start.
- Data Observability Tools: Like data catalogs, data observability tools (e.g. Monte Carlo / Acceldata / Bigeye) also “report the news,” providing visibility but often without real control of data platform components. They can notice the pan is burning but they can’t move the pan off the stove. Although the insights from data observability tools are invaluable, this lack of control makes them ill-suited as the foundational building block for an operating system for DataOps.
- Data Transformation Tools: Data transformation tools (e.g. dbt / SQLMesh) excel at enabling end users to repeatably prep data where complex logic is required. But they can feel like just one step in the data lifecycle journey. Unless tightly coupled with a system that delivers a comprehensive view of the entire workflow, they have no clue where the data came from or how it’s going to be used downstream, both of which are critical for seamless DataOps. Plus, they’re often SQL-centric, built for analysts focused on reporting, not engineers working on AI or software applications.
- Data integration tools: Whether its legacy tools from vendors like Informatica or the more modern approach of vendors like Fivetran, data integration vendors do the job of moving data from point A to point B well (it just might hurt your wallet!). But like transformation tools, they only address one part of the data lifecycle with no context on how the data is consumed. It's a bit like driving on a road with no idea where it's leading you or what you are going to do when you get there. And for what it’s worth, when we talk to customers, this is the area where they are most aggressively exploring the potential of generative AI to release the stranglehold of vendor control and exploitative pricing.
To be clear, we believe that all of these categories have an important part to play in operationalizing data, but if you really want unification, we believe that it has to start with orchestration.
Why orchestration-first?
With control, management, and visibility of both data and its associated metadata, orchestration offers a unique architectural advantage over any other category of tech to unify the data stack.
Orchestration connects to all your tools and data sources. It knows where your data comes from, where it’s going, and how it’s being used. With deep integration and unparalleled context, orchestration is the ultimate control and unification layer, letting teams quickly adapt workflows, adjust pipelines, and stay agile as priorities shift.
It’s not just about connecting systems—it’s about doing it faster and better. Orchestration lets you plug and play with the best tools on the market, so you’re never locked into outdated tech. It speeds up time-to-market for data products by giving you control over your entire data ecosystem, enabling data to flow seamlessly from raw sources to polished assets ready for consumption by any use case.
Beyond just handling data pipelines, orchestration sets you up for what’s next. It’s the foundation for data science, machine learning, and AI operations, while also giving businesses the flexibility to avoid vendor lock-in and maintain strategic leverage. In a fast-changing world of data, orchestration isn’t just an advantage—it’s a necessity.
As DataOps continues to consolidate, it’s no surprise to see vendors from other parts of the stack trying to move into the orchestration layer—whether by acquiring an existing solution or building their own. But as we’ve all witnessed, platforms assembled through acquisition often end up feeling disjointed.
Meanwhile, creating a robust orchestration platform is no small task. Look at Apache Airflow®, for instance. It has taken over a decade of development, plus extensive community efforts, to build out the features and more than 1,600 ecosystem integrations that data teams depend on daily. Even with advances in generative AI, there’s no shortcut for cultivating a strong community or replicating the hard-earned experience of supporting mission-critical workflows in the world’s largest enterprises.
Airflow: Leading the orchestration charge
Apache Airflow isn’t just the leader in orchestration—it’s the industry standard. No other solution, open-source or proprietary, comes close to its adoption or impact. With over 3,000 contributors—more than Apache Spark® and Apache Kafka®—Airflow has become a global phenomenon. It’s downloaded over 30 million times every month, and 2024 alone saw more downloads than all previous years combined.
The demand for reliable, secure, and scalable orchestration has never been higher, and Airflow’s user base reflects that. Once primarily used by data engineers, it’s now also a critical tool for AI/ML engineers, software developers, and teams building data-driven apps. Generative AI, MLOps, and real-time analytics rely on Airflow to deliver the high-quality, trustworthy data products these use cases demand.
Nowhere was this expansion of users and use cases better demonstrated than the 2024 Airflow Summit. There, some of the world’s most advanced and sophisticated companies came to share their Airflow experiences, learnings, and results. A small sample of the headlines are below, with all of the details available from the “Airflow in Action” series posted to the Astronomer blog:
- Uber: Orchestrating 200,000 data pipelines relied on by 1,000+ internal teams.
- Stripe: Processing petabytes of data daily to power payments analytics.
- Apple: Accelerating the deployment of data science and ML experiments to production.
- Bloomberg: Cutting the time to build financial data products by 50%.
- LinkedIn: Managing 1 million infrastructure and service deployments every month.
- Ford: Training autonomous driving models on 1 petabyte of new sensor data weekly.
- Robinhood: Powering 4,000 financial workflows managing $160 billion in assets.
- Cloudflare: Deploying AI GPU clusters to 100 data centers in just three months.
- Instacart: Orchestrating 2,200 data pipelines serving 85,000 stores and millions of customers.
- Panasonic: Driving EV innovation at the Tesla Gigafactory with operational analytics.
These stories are just the beginning. Airflow’s unmatched scalability and flexibility make it the backbone of modern orchestration, trusted by the most innovative companies in the world.
Astronomer: From orchestration to DataOps
At Astronomer, we’re supporting the rise of DataOps, powered by the unstoppable momentum of Apache Airflow and our orchestration-first platform, Astro.
Astronomer leads the Airflow ecosystem, managing 100% of new releases, contributing 55% of the codebase, and employing 18 of the top 25 committers and 8 PMC members.
We’ve built Astro on this foundation, taking Airflow further with a fully managed, cross-cloud orchestration platform. Astro isn’t just Airflow—it’s more. It gives data teams the exclusive capabilities to seamlessly BUILD, RUN, and OBSERVE all their data products in one place. That’s why over 700 customers, from startups to Fortune 500 enterprises, trust Astro to power their data operations.
Figure 2: Astro is positioned to be the leader in unified DataOps as the data landscape continues to evolve
But we’re not stopping here. Customers are asking for more, and we’re delivering. Last year, we announced our plans to integrate data observability directly into orchestration, with further plans to incorporate other layers of the DataOps stack over time.
We are also building AI/Intelligence-driven enhancements into Astro today that boost reliability, efficiency, and productivity across the entire data lifecycle, such as:
- Build: Simplifies pipeline creation with natural language authoring and enforces best practices through code consolidation and pattern recognition.
- Run: Automates pipeline troubleshooting and resource tuning, enabling self-healing pipelines and zero-config compute.
- Observe: Proactively detects data issues and provides expert recommendations to optimize costs, reliability, and upgrades.
The result? A unified DataOps platform that cuts through today’s chaos above the compute layer, replacing fragmentation with end-to-end visibility, control, and automation. With Astro, data teams will achieve massive gains in reliability, efficiency, productivity, and the business value their data products deliver.
As we close out FY25, we’ve exceeded every target set by our board. Astronomer is a rocket ship helping the most innovative businesses solve their biggest data challenges. If you are top talent looking to work on one of the most important problems in the industry – we’d love to speak with you!