PodcastsEducationData Engineering Podcast

Data Engineering Podcast

Tobias Macey
Data Engineering Podcast
Latest episode

503 episodes

  • Data Engineering Podcast

    From Data Models to Mind Models: Designing AI Memory at Scale

    2026-2-22 | 57 mins.
    Summary
    In this episode of the Data Engineering Podcast, Vasilije "Vas" Markovich, founder of Cognee, discusses building agentic memory, a crucial aspect of artificial intelligence that enables systems to learn, adapt, and retain knowledge over time. He explains the concept of agentic memory, highlighting the importance of distinguishing between permanent and session memory, graph+vector layers, latency trade-offs, and multi-tenant isolation to ensure safe knowledge sharing or protection. The conversation covers practical considerations such as storage choices (Redis, Qdrant, LanceDB, Neo4j), metadata design, temporal relevance and decay, and emerging research areas like trace-based scoring and reinforcement learning for improving retrieval. Vas shares real-world examples of agentic memory in action, including applications in pharma hypothesis discovery, logistics control towers, and cybersecurity feeds, as well as scenarios where simpler approaches may suffice. He also offers guidance on when to add memory, pitfalls to avoid (naive summarization, uncontrolled fine-tuning), human-in-the-loop realities, and Cognee's future plans: revamped session/long-term stores, decision-trace research, and richer time and transformation mechanisms. Additionally, Vas touches on policy guardrails for agent actions and the potential for more efficient "pseudo-languages" for multi-agent collaboration.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    Your host is Tobias Macey and today I'm interviewing Vasilije Markovic about agentic memory architectures and applications

    Interview

    Introduction
    How did you get involved in the area of data management?
    Can you start by giving an overview of the different elements of "memory" in an agentic context?
    storage and retrieval mechanisms
    how to model memories
    how does that change as you go from short-term to long-term?
    managing scope and retrieval triggers
    What are some of the useful triggers in an agent architecture to identify whether/when/what to create a new memory?
    How do things change as you try to build a shared corpus of memory across agents?
    What are the most interesting, innovative, or unexpected ways that you have seen agentic memory used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cognee?
    When is a dedicated memory layer the wrong choice?
    What do you have planned for the future of Cognee?

    Contact Info

    LinkedIn

    Parting Question

    From your perspective, what is the biggest gap in the tooling or technology for data management today?

    Closing Announcements

    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Links

    Cognee
    AI Engineering Podcast Episode
    [Kimball Memory](
    Cognitive Science
    Context Window
    RAG == Retrieval Augmented Generation
    Memory Types
    Redis Vector Store
    Qdrant
    Vector on Edge
    Milvus
    LanceDB
    KuzuDB
    Neo4J
    Mem0
    Zepp Graphiti
    A2A (Agent-to-Agent) Protocol
    Snowplow
    Reinforcement Learning
    Model Finetuning
    OpenClaw

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
  • Data Engineering Podcast

    Prompt Management, Tracing, and Evals: The New Table Stakes for GenAI Ops

    2026-2-15 | 50 mins.
    Summary
    In this episode of the Data Engineering Podcast, Aman Agarwal, creator of OpenLit, discusses the operational groundwork required to run LLM-powered applications reliably and cost-effectively. He highlights common blind spots that teams face, including opaque model behavior, runaway token costs, and brittle prompt management, and explains how OpenTelemetry-native observability can turn these black-box interactions into stepwise, debuggable traces across models, tools, and data stores. Aman showcases OpenLit's approach to open standards, vendor-neutral integrations, and practical features such as fleet-managed OTEL collectors, zero-code Kubernetes instrumentation, prompt and secret management, and evaluation workflows. They also explore experimentation patterns, routing across models, and closing the loop from evals to prompt/dataset improvements, demonstrating how better visibility reshapes design choices from prototype to production. Aman shares lessons learned building in the open, where OpenLit fits and doesn't, and what's next in context management, security, and ecosystem integrations, providing resources and examples of multi-database observability deployments for listeners.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    Your host is Tobias Macey and today I'm interviewing Aman Agarwal about the operational investments that are necessary to ensure you get the most out of your AI models

    Interview

    Introduction
    How did you get involved in the area of AI/data management?
    Can you start by giving your assessment of the main blind spots that are common in the existing AI application patterns?
    As teams adopt agentic architectures, how common is it to fall prey to those same blind spots?
    There are numerous tools/services available now focused on various elements of "LLMOps". What are the major components necessary for a minimum viable operational platform for LLMs?
    There are several areas of overlap, as well as disjoint features, in the ecosystem of tools (both open source and commercial). How do you advise teams to navigate the selection process? (point solutions vs. integrated tools, and handling frameworks with only partial overlap)
    Can you describe what OpenLit is and the story behind it?
    How would you characterize the feature set and focus of OpenLit compared to what you view as the "major players"?
    Once you have invested in a platform like OpenLit, how does that change the overall development workflow for the lifecycle of AI/agentic applications?
    What are the most complex/challenging elements of change management for LLM-powered systems? (e.g. prompt tuning, model changes, data changes, etc.)
    How can the information collected in OpenLit be used to develop a self-improvement flywheel for agentic systems?
    Can you describe the architecture and implementation of OpenLit?
    How have the scope and goals of the project changed since you started working on it?
    Given the foundational aspects of the project that you have built, what are some of the adjacent capabilities that OpenLit is situated to expand into?
    What are the sharp edges and blind spots that are still challenging even when you have OpenLit or similar integrated?
    What are the most interesting, innovative, or unexpected ways that you have seen OpenLit used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on OpenLit?
    When is OpenLit the wrong choice?
    What do you have planned for the future of OpenLit?

    Contact Info

    LinkedIn

    Parting Question

    From your perspective, what is the biggest gap in the tooling or technology for data/AI management today?

    Closing Announcements

    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Links

    OpenLit
    Fleet Hub
    OpenTelemetry
    LangFuse
    LangSmith
    TensorZero
    AI Engineering Podcast Episode
    Traceloop
    Helicone
    Clickhouse

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
  • Data Engineering Podcast

    From Legacy to AI-Ready: How MongoDB AMP Accelerates Modernization

    2026-2-08 | 46 mins.
    Summary
    In this episode, Shilpa Kolhar, SVP of Product and Engineering at MongoDB, discusses using MongoDB as a unified foundation for AI-driven and agentic applications. She explains how the Application Modernization Platform (AMP) accelerates the transition from legacy relational systems to a document-first architecture, driven by the need for AI-readiness and speed of change. Shilpa highlights MongoDB's features, such as its native JSON document model, Atlas Vector Search, auto-embeddings, and integrated search, which help eliminate drift and latency across operational data, indexing, and vectors, emphasizing the importance of keeping context, transactions, and embeddings together for real-time AI use cases. She shares best practices for re-architecting legacy systems, including schema validation and versioning patterns to tame schema drift, aggregation pipelines for consistent reads, and pragmatic standardization across services, while also detailing AMP's approach to scoping large estates and the balance of LLM-powered automation with human-in-the-loop governance.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    Your host is Tobias Macey and today I'm interviewing Shilpa Kolhar about using MongoDB as the foundation for AI-driven applications
    Interview
    Introduction
    How did you get involved in the area of data management?
    Can you describe what MongoDB is and the core primitives that it offers?
    The MongoDB engine has gone through substantial evolution since it was first introduced over 20 years ago. What are some of the most notable features that have been added in recent years?
    You recently launched the MongoDB Application Modernization Platform (AMP). What are the key elements of modernization that it is focused on?
    How do the core primitives of the MongoDB engine align with modernization objectives?
    There is a lot of attention being paid now to AI applications where data is the most critical element for success. What are the features of MongoDB that lend itself to being the context store for generative AI services?
    Besides the data used for context and grounding, AI applications also want to track user interactions and form short and long term memory to improve the system over time. How can MongoDB assist in that work as well?
    While the lack of schema enforcement on write can be beneficial to rapid evolution of software, it can also be a detriment if not managed well. How can MongoDB help in avoiding schema drift over time that leads to old data being incompatible with current code?
    What are the most interesting, innovative, or unexpected ways that you have seen MongoDB used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on MongoDB and application modernization?
    When is MongoDB/AMP the wrong choice?
    What do you have planned for the future of AMP?
    Contact Info
    LinkedIn
    Parting Question
    From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    MongoDB
    MongoDB AMP
    Google Gemini
    Voyage AI
    Qdrant
    ChromaDB
    Weaviate
    Pinecone
    MongoDB Autoembedding
    Retool
    ODM == Object Document Mapper
    RAG == Retrieval Augmented Generation
    Agentic Memory
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
  • Data Engineering Podcast

    Branches, Diffs, and SQL: How Dolt Powers Agentic Workflows

    2026-2-01 | 56 mins.
    Summary
    In this episode Tim Sehn, founder and CEO of DoltHub, talks about Dolt - the world’s first version‑controlled SQL database - and why Git‑style semantics belong at the heart of data systems and AI workflows. Tim explains how Dolt combines a MySQL/Postgres‑compatible interface with a novel storage engine built on a “Prollytree” to enable fast, row‑level branching, merging, and diffs of both schema and data. He digs into real production use cases: powering applications that expose version control to end users, reproducible ML feature stores, managing massive configuration for games, and enabling safe agentic writes via branch‑based review flows. He compares Dolt’s approach to LakeFS, Neon, and PlanetScale, and explores developer workflows unlocked by decentralized clones, full audit logs, and PR‑style data reviews.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    Your host is Tobias Macey and today I'm interviewing Tim Sehn about Dolt, a version controlled database engine and its applications for agentic workflows

    Interview

    Introduction
    How did you get involved in the area of data management?
    Can you describe what Dolt is and the story behind it?
    What are the key use cases that you are focused on solving by adding version control to the database layer?
    There are numerous projects related to different aspects of versioning in different data contexts (e.g. LakeFS, Datomic, etc.). What are the versioning semantics that you are focused on?
    You position Dolt as "the database for AI". How does data versioning relate to AI use cases?
    What types of AI systems are able to make best use of Dolt's versioning capabilities?
    Can you describe how Dolt and Doltgres are implemented?
    How have the design and scope of the project changed since you first started working on it?
    What are some of the architecture and integration patterns around relational databases that change when you introduce version control semantics as a core primitive?
    What are some anti-patterns that you have seen teams develop around Dolt's versioning functionality?
    What are the most interesting, innovative, or unexpected ways that you have seen Dolt used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dolt?
    When is Dolt the wrong choice?
    What do you have planned for the future of Dolt?

    Contact Info

    LinkedIn

    Parting Question

    From your perspective, what is the biggest gap in the tooling or technology for data management today?

    Closing Announcements

    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Links

    Dolt
    DoltHub
    Stockmarket Data
    LakeFS
    Datomic
    Git
    MySQL
    Prolly Tree
    Neon
    Django
    Feature Store
    MCP Server
    Nessie
    Iceberg
    PlanetScale
    O(NlogN) Big O Complexity
    B-Tree
    Git Merge
    Git Rebase
    AST == Abstract Syntax Tree
    Supabase
    CockroachDB
    Document Database
    MongoDB
    Gastown
    Beads

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
  • Data Engineering Podcast

    Logical First, Physical Second: A Pragmatic Path to Trusted Data

    2026-1-25 | 40 mins.
    Summary
    In this episode of the Data Engineering Podcast Jamie Knowles, Product Director for ER/Studio, talks about data architecture and its importance in driving business meaning. He discusses how data architecture should start with business meaning, not just physical schemas, and explores the pitfalls of jumping straight to physical designs. Jamie shares his practical definition of data architecture centered on shared semantic models that anchor transactional, analytical, and event-driven systems. The conversation covers strategies for evolving an architecture in tandem with delivery, including defining core concepts, aligning teams through governance, and treating the model as a living product. He also examines how generative AI can both help and harm data architecture, accelerating first drafts but amplifying risk without a human-approved ontology. Jamie emphasizes the importance of doing the hard work upfront to make meaning explicit, keeping models simple and business-aligned, and using tools and patterns to reuse that meaning everywhere.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/Build
    Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.
    Your host is Tobias Macey and today I'm interviewing Jamie Knowles about the impact that a well-developed data architecture (or lack thereof) has on data engineering work

    Interview
    Introduction
    How did you get involved in the area of data management?
    Can you start by giving your definition of "data architecture" and what it encompasses?
    How does the nuance change depending on the type of system you are designing? (e.g. data warehouse vs. transactional application database vs. event-driven streaming service)
    In application teams that are large enough there is typically a software architect, but that work often ends up happening organically through trial and error. Who is the responsible party for designing and enforcing a proper data architecture?
    There have been several generational shifts in approach to data warehouse projects in particular. What are some of the anti-patterns that crop up when there is no-one forming a strong opinion on the design/architecture of the warehouse?
    The current stage is largely defined by the ELT pattern. What are some of the ways that workflow can encourage shortcuts?
    Often the need for a proper architecture isn't felt until an organic architecture has developed. What are some of the ways that teams can short-circuit that pain and iterate toward a more sustainable design?
    The common theme in all of the data architecture conversations that I've had is the need for business involvement. There is also a strong push for the business to just want the engineers to deliver data. What are some of the ways that AI utilities can help to accelerate delivery while also capturing business context?
    For teams that are already neck deep in a messy architecture, what are the strategies and tactics that they need to start working toward today to get to a better data architecture?
    What are the most interesting, innovative, or unexpected ways that you have seen teams approach the creation and implementation of their data architecture?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working in data architecture?
    How do you see the introduction of AI at each stage of the data lifecycle changing the ways that teams think about their architectural needs?

    Contact Info
    LinkedIn

    Parting Question
    From your perspective, what is the biggest gap in the tooling or technology for data management today?

    Closing Announcements
    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Links
    Idera
    ER Studio
    ELT
    RDF == Resource Description Framework
    ORM == Object-Relational Mapping

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

More Education podcasts

About Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Podcast website

Listen to Data Engineering Podcast, The Daily Stoic and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features

Data Engineering Podcast: Podcasts in Family

Social
v8.6.0 | © 2007-2026 radio.de GmbH
Generated: 2/23/2026 - 7:52:03 PM