Data Engineering Podcast podcast | Listen online for free

495 episodes

Unfreezing The Data Lake: The Future-Proof File Format
2025-12-29 | 59 mins.
Summary In this episode PhD researcher Xinyu Zeng talks about F3, the “future-proof file format” designed to address today’s hardware realities and evolving workloads. He digs into the limitations of Parquet and ORC - especially CPU-bound decoding, metadata overhead for wide-table projections, and poor random-access behavior for ML training and serving - and how F3 rethinks layout and encodings to be efficient, interoperable, and extensible. Xinyu explains F3’s two major ideas: a decoupled, flexible layout that separates IO units, dictionary scope, and encoding choices; and self-decoding files that embed WebAssembly kernels so new encodings can be adopted without waiting on every engine to upgrade. He discusses how table formats and file formats should increasingly be decoupled, potential synergies between F3 and table layers (including centralizing and verifying WASM kernels), and future directions such as extending WASM beyond encodings to indexing or filtering. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementYou’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Xinyu Zeng about the future-proof file formatInterview IntroductionHow did you get involved in the area of data management?Can you describe what the F3 project is and the story behind it?We have several widely adopted file formats (Parquet, ORC, Avro, etc.). Why do we keep creating new ones?Parquet is the format with perhaps the broadest adoption. What are the challenges that such wide use poses when trying to modify or extend the specification?The recent focus on vector data is perhaps the most visible change in storage requirements. What are some of the other custom types of data that might need to be supported in the file storage layer?Can you describe the key design principles of the F3 format?What are the engineering challenges that you faced while developing your implementation of the F3 proof-of-concept?The key challenge of introducing a new format is that of adoption. What are the provisions in F3 that might simplify the adoption of the format in the broader ecosystem? (e.g. integration with compute frameworks)What are some examples of features in data lake use cases that could be enabled by F3?What are some of the other ideas/hypotheses that you developed and discarded in the process of your reasearch?What are the most interesting, unexpected, or challenging lessons that you have learned while working on F3?What do you have planned for the future of F3?Contact Info Personal WebsiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links F3 PaperFormats Evaluation PaperF3 GithubSAL PaperRisingWaveTencent CloudParquetArrowAndy PavloWes McKinneyCMU Public SeminarVLDBORCProtocol BuffersLancePAX == Partition Attributes AcrossWASM == Web AssemblyDataFusionDuckDBDuckLakeVeloxVortex File FormatThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
From Context to Semantics: How Metadata Powers Agentic AI
2025-12-21 | 1h 6 mins.
Summary In this episode Suresh Srinivas and Sriharsha Chintalapani explore how metadata platforms are evolving from human-centric catalogs into the foundational context layer for AI and agentic systems. They discuss the origins and growth of OpenMetadata and Collate, why “context” is necessary but “semantics” is critical for precise AI outcomes, and how a schema-first, API-first, unified platform enables discovery, observability, and governance in one workflow. They share how AI agents can now automate documentation, classification, data quality testing, and enforcement of policies, and why aligning governance with user identity and intent is essential as agentic access scales. They also dig into scalability strategies, MCP-based agent workflows, AI governance (including model/agent tracking), and the emerging convergence of big data with ontologies to deliver machine-understandable meaning. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildYour host is Tobias Macey and today I'm interviewing Suresh Srinivas and Sriharsha Chintalapani about how metadata catalogs provide the context clues necessary to give meaning to your data for AI systemsInterviewIntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the roles that metadata catalogs are playing in the current state of the ecosystem?How has the OpenMetadata platform evolved over the past 4 years?How has the focus on LLMs/generative AI changed the trajectory of services like OpenMetadata?The initial set of use cases for data catalogs was to facilitate discovery and documentation of data assets for human consumption. What are the structural elements of that effort that have paid dividends for an AI audience?How does the AI audience change the requirements around the cataloging and presentation of metadata?One of the constant challenges in data infrastructure now is the tension of making data accessible to AI systems (agentic or otherwise) and incorporating AI into the inner loop of the service. What are the opportunities for bringing AI inside the boundaries of a system like OpenMetadata vs. as a client or consumer of the platform?The key phrase of the past ~2 years is "context engineering". What role does the metadata catalog play in that undertaking?What are the capabilities that the catalog needs to be able to effectively populate and curate that context?How much awareness does the LLM or agent need to have to be able to use the catalog effectively?What does a typical workflow/agent loop look like when it is using something like OpenMetadata in pursuit of knowledge that it needs to achieve an objective?How do agentic use cases strain the existing set of governance frameworks?What new considerations (procedural or technical) need to be factored into governance practices to balance velocity with security?What are the most interesting, innovative, or unexpected ways that you have seen OpenMetadata/Collate used in AI/agentic contexts?What are the most interesting, unexpected, or challenging lessons that you have learned while working on OpenMetadata/Collate?When is OpenMetadata/Collate the wrong choice?What do you have planned for the future of OpenMetadata?Contact InfoSureshLinkedInSriharshaLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?LinksOpenMetadata Podcast EpisodeHadoopHortonworksContext EngineeringMCP == Model Context ProtocolJSON SchemadbtLangSmithOpenMetadata MCP ServerAPI GatewayThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
From Data Engineering to AI Engineering: Where the Lines Blur
2025-12-14 | 26 mins.
Summary In this solo episode of the Data Engineering Podcast, host Tobias Macey reflects on how AI has transformed the practice and pace of data engineering over time. Starting from its origins in the Hadoop and cloud warehouse era, he explores the discipline's evolution through ML engineering and MLOps to today's blended boundaries between data, ML, and AI engineering. The conversation covers how unstructured data is becoming more prominent, vectors and knowledge graphs are emerging as key components, and reliability expectations are changing due to interactive user-facing AI. The host also delves into process changes, including tighter collaboration, faster dataset onboarding, new governance and access controls, and the importance of treating experimentation and evaluation as fundamental testing practices. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildYour host is Tobias Macey and today I'm interviewing reflecting about the increasingly blurry boundaries between data engineering and AI engineeringInterviewIntroductionI started this podcast in 2017, right when the term "Data Engineer" was becoming widely used for a specific job title with a reasonably well-understood set of responsibilities. This was in response to the massive hype around "data science" and consequent hiring sprees that characterized the mid-2000s to mid-2010s. The introduction of generative AI and AI Engineering to the technical ecosystem is changing the scope of responsibilities for data engineers and other data practitioners. Of note is the fact that:AI models can be used to process unstructured data sources into structured data assetsAI applications require new types of data assetsThe SLAs for data assets related to AI serving are different from BI/warehouse use casesThe technology stacks for AI applications aren't necessarily the same as for analytical data pipelinesBecause everything is so new there is not a lot of prior art, and the prior art that does exist isn't necessarily easy to find because of differences in terminologyExperimentation has moved from being just an MLOps capability into being a core need for organizationsContact InfoEmailParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?LinksAI Engineering PodcastThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Malloy: Hierarchical Data, Semantic Models, and the Future of Analytics
2025-12-08 | 58 mins.
Summary In this episode Michael Toy, co-creator of Malloy, talks about rethinking how we work with data beyond SQL. Michael shares the origins of Malloy from his and Lloyd Tabb’s experience at Looker, why SQL’s mental model often fights human problem solving, and how Malloy aims to be a composable, maintainable language that treats SQL as the assembly layer rather than something humans should write. He explores Malloy’s core ideas — semantic modeling tightly coupled with a query language, hierarchical data as the default mental model, and preserving context so analysis stays interactive and open-ended. He also digs into the developer experience and ecosystem: Malloy’s TypeScript implementation, VS Code integration, CLI, emerging notebook support, and how Malloy can sit alongside or replace parts of existing transformation workflows. Michael discusses practical trade-offs in language design, the surprising fit for LLM-generated queries, and near-term roadmap areas like dimensional filtering, better aggregation strategies across levels, and closing gaps that still require escaping to SQL. He closes with an invitation to contribute to the open-source project and help shape its evolution. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildYour host is Tobias Macey and today I'm interviewing Michael Toy about Malloy, a modern language for building composable and maintainable analytics and data models on relational enginesInterview IntroductionHow did you get involved in the area of data management?Can you describe what Malloy is and the story behind it?What is the core problem that you are trying to solve with Malloy?There are countless projects that aim to reimagine/reinvent/replace SQL. What are the factors that make Malloy stand out in your mind?Who are the target personas for the Malloy language?One of the key success factors for any language is the ecosystem around it and the integrations available to it. How does Malloy fit in the toolchains and workflows for data engineers and analysts?Can you describe the key design and syntax elements of Malloy?How have the scope and focus of the language evolved since you first started working on it?How do the structure and semantics of Malloy change the ways that teams think about their data models?SQL-focused tools have gained prominence as the means of building the tranfromation stage of data pipelines. How would you characterize the capabilities of Malloy as a tool for building translation pipelines?What are the most interesting, innovative, or unexpected ways that you have seen Malloy used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Malloy?When is Malloy the wrong choice?What do you have planned for the future of Malloy?Contact InfoWebsiteParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?LinksMalloyLloyd TabbSQLLookerLookMLdbtRelational AlgebraTypescriptRuby[Truffle](Malloy VSCode PluginMalloy CLIMalloy Pick StatementThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Blurring Lines: Data, AI, and the New Playbook for Team Velocity
2025-11-24 | 1h
SummaryIn this crossover episode, Max Beauchemin explores how multiplayer, multi‑agent engineering is transforming the way individuals and teams build data and AI systems. He digs into the shifting boundary between data and AI engineering, the rise of “context as code,” and how just‑in‑time retrieval via MCP and CLIs lets agents gather what they need without bloating context windows. Max shares hard‑won practices from going “AI‑first” for most tasks, where humans focus on orchestration and taste, and the new bottlenecks that appear — code review, QA, async coordination — when execution accelerates 2–10x. He also dives deep into Agor, his open‑source agent orchestration platform: a spatial, multiplayer workspace that manages Git worktrees and live dev environments, templatizes prompts by workflow zones, supports session forking and sub‑sessions, and exposes an internal MCP so agents can schedule, monitor, and even coordinate other agents.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Maxime Beauchemin about the impact of multi-player multi-agent engineering on individual and team velocity for building better data systemsInterviewIntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the types of work that you are relying on AI development agents for?As you bring agents into the mix for software engineering, what are the bottlenecks that start to show up?In my own experience there are a finite number of agents that I can manage in parallel. How does Agor help to increase that limit?How does making multi-agent management a multi-player experience change the dynamics of how you apply agentic engineering workflows?Contact InfoLinkedInLinksAgorApache AirflowApache SupersetPresetClaude CodeCodexPlaywright MCPTmuxGit WorktreesOpencode.aiGitHub CodespacesOnaThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

More Education podcasts

Trending Education podcasts

About Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Podcast website

Education Technology