
Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific)
2025-12-20 | 16 mins.
Is a car that wins a Formula 1 race the best choice for your morning commute? Probably not. In this sponsored deep dive with Prolific, we explore why the same logic applies to Artificial Intelligence. While models are currently shattering records on technical exams, they often fail the most important test of all: **the human experience.**Why High Benchmark Scores Don’t Mean Better AIJoining us are **Andrew Gordon** (Staff Researcher in Behavioral Science) and **Nora Petrova** (AI Researcher) from **Prolific**. They reveal the hidden flaws in how we currently rank AI and introduce a more rigorous, "humane" way to measure whether these models are actually helpful, safe, and relatable for real people.---Key Insights in This Episode:* *The F1 Car Analogy:* Andrew explains why a model that excels at the "Humanities Last Exam" might be a nightmare for daily use. Technical benchmarks often ignore the nuances of human communication and adaptability.* *The "Wild West" of AI Safety:* As users turn to AI for sensitive topics like mental health, Nora highlights the alarming lack of oversight and the "thin veneer" of safety training—citing recent controversial incidents like Grok-3’s "Mecha Hitler."* *Fixing the "Leaderboard Illusion":* The team critiques current popular rankings like Chatbot Arena, discussing how anonymous, unstratified voting can lead to biased results and how companies can "game" the system.* *The Xbox Secret to AI Ranking:* Discover how Prolific uses *TrueSkill*—the same algorithm Microsoft developed for Xbox Live matchmaking—to create a fairer, more statistically sound leaderboard for LLMs.* *The Personality Gap:* Early data from the **Humane Leaderboard** suggests that while AI is getting smarter, it is actually performing *worse* on metrics like personality, culture, and "sycophancy" (the tendency for models to become annoying "people-pleasers").---About the HUMAINE LeaderboardMoving beyond simple "A vs. B" testing, the researchers discuss their new framework that samples participants based on *census data* (Age, Ethnicity, Political Alignment). By using a representative sample of the general public rather than just tech enthusiasts, they are building a standard that reflects the values of the real world.*Are we building models for benchmarks, or are we building them for humans? It’s time to change the scoreboard.*Rescript link:https://app.rescript.info/public/share/IDqwjY9Q43S22qSgL5EkWGFymJwZ3SVxvrfpgHZLXQc---TIMESTAMPS:00:00:00 Introduction & The Benchmarking Problem00:01:58 The Fractured State of AI Evaluation00:03:54 AI Safety & Interpretability00:05:45 Bias in Chatbot Arena00:06:45 Prolific's Three Pillars Approach00:09:01 TrueSkill Ranking & Efficient Sampling00:12:04 Census-Based Representative Sampling00:13:00 Key Findings: Culture, Personality & Sycophancy---REFERENCES:Paper:[00:00:15] MMLUhttps://arxiv.org/abs/2009.03300[00:05:10] Constitutional AIhttps://arxiv.org/abs/2212.08073[00:06:45] The Leaderboard Illusionhttps://arxiv.org/abs/2504.20879[00:09:41] HUMAINE Framework Paperhttps://huggingface.co/blog/ProlificAI/humaine-frameworkCompany:[00:00:30] Prolifichttps://www.prolific.com[00:01:45] Chatbot Arenahttps://lmarena.ai/Person:[00:00:35] Andrew Gordonhttps://www.linkedin.com/in/andrew-gordon-03879919a/[00:00:45] Nora Petrovahttps://www.linkedin.com/in/nora-petrova/Event:Algorithm:[00:09:01] Microsoft TrueSkillhttps://www.microsoft.com/en-us/research/project/trueskill-ranking-system/Leaderboard:[00:09:21] Prolific HUMAINE Leaderboardhttps://www.prolific.com/humaine[00:09:31] HUMAINE HuggingFace Spacehttps://huggingface.co/spaces/ProlificAI/humaine-leaderboard[00:10:21] Prolific AI Leaderboard Portalhttps://www.prolific.com/leaderboardDataset:[00:09:51] Prolific Social Reasoning RLHF Datasethttps://huggingface.co/datasets/ProlificAI/social-reasoning-rlhfOrganization:[00:10:31] MLCommonshttps://mlcommons.org/

The Mathematical Foundations of Intelligence [Professor Yi Ma]
2025-12-13 | 1h 39 mins.
What if everything we think we know about AI understanding is wrong? Is compression the key to intelligence? Or is there something more—a leap from memorization to true abstraction? In this fascinating conversation, we sit down with **Professor Yi Ma**—world-renowned expert in deep learning, IEEE/ACM Fellow, and author of the groundbreaking new book *Learning Deep Representations of Data Distributions*. Professor Ma challenges our assumptions about what large language models actually do, reveals why 3D reconstruction isn't the same as understanding, and presents a unified mathematical theory of intelligence built on just two principles: **parsimony** and **self-consistency**.**SPONSOR MESSAGES START**—Prolific - Quality data. From real people. For faster breakthroughs.https://www.prolific.com/?utm_source=mlst—cyber•Fund https://cyber.fund/?utm_source=mlst is a founder-led investment firm accelerating the cybernetic economyHiring a SF VC Principal: https://talent.cyber.fund/companies/cyber-fund-2/jobs/57674170-ai-investment-principal#content?utm_source=mlstSubmit investment deck: https://cyber.fund/contact?utm_source=mlst—**END**Key Insights:**LLMs Don't Understand—They Memorize**Language models process text (*already* compressed human knowledge) using the same mechanism we use to learn from raw data. **The Illusion of 3D Vision**Sora and NeRFs etc that can reconstruct 3D scenes still fail miserably at basic spatial reasoning**"All Roads Lead to Rome"**Why adding noise is *necessary* for discovering structure.**Why Gradient Descent Actually Works**Natural optimization landscapes are surprisingly smooth—a "blessing of dimensionality" **Transformers from First Principles**Transformer architectures can be mathematically derived from compression principles—INTERACTIVE AI TRANSCRIPT PLAYER w/REFS (ReScript):https://app.rescript.info/public/share/Z-dMPiUhXaeMEcdeU6Bz84GOVsvdcfxU_8Ptu6CTKMQAbout Professor Yi MaYi Ma is the inaugural director of the School of Computing and Data Science at Hong Kong University and a visiting professor at UC Berkeley. https://people.eecs.berkeley.edu/~yima/https://scholar.google.com/citations?user=XqLiBQMAAAAJ&hl=en https://x.com/YiMaTweets **Slides from this conversation:**https://www.dropbox.com/scl/fi/sbhbyievw7idup8j06mlr/slides.pdf?rlkey=7ptovemezo8bj8tkhfi393fh9&dl=0**Related Talks by Professor Ma:**- Pursuing the Nature of Intelligence (ICLR): https://www.youtube.com/watch?v=LT-F0xSNSjo- Earlier talk at Berkeley: https://www.youtube.com/watch?v=TihaCUjyRLMTIMESTAMPS:00:00:00 Introduction00:02:08 The First Principles Book & Research Vision00:05:21 Two Pillars: Parsimony & Consistency00:09:50 Evolution vs. Learning: The Compression Mechanism00:14:36 LLMs: Memorization Masquerading as Understanding00:19:55 The Leap to Abstraction: Empirical vs. Scientific00:27:30 Platonism, Deduction & The ARC Challenge00:35:57 Specialization & The Cybernetic Legacy00:41:23 Deriving Maximum Rate Reduction00:48:21 The Illusion of 3D Understanding: Sora & NeRF00:54:26 All Roads Lead to Rome: The Role of Noise00:59:56 All Roads Lead to Rome: The Role of Noise01:00:14 Benign Non-Convexity: Why Optimization Works01:06:35 Double Descent & The Myth of Overfitting01:14:26 Self-Consistency: Closed-Loop Learning01:21:03 Deriving Transformers from First Principles01:30:11 Verification & The Kevin Murphy Question01:34:11 CRATE vs. ViT: White-Box AI & ConclusionREFERENCES:Book:[00:03:04] Learning Deep Representations of Data Distributionshttps://ma-lab-berkeley.github.io/deep-representation-learning-book/[00:18:38] A Brief History of Intelligencehttps://www.amazon.co.uk/BRIEF-HISTORY-INTELLIGEN-HB-Evolution/dp/0008560099[00:38:14] Cyberneticshttps://mitpress.mit.edu/9780262730099/cybernetics/Book (Yi Ma):[00:03:14] 3-D Vision bookhttps://link.springer.com/book/10.1007/978-0-387-21779-6<TRUNC> refs on ReScript link/YT

Pedro Domingos: Tensor Logic Unifies AI Paradigms
2025-12-08 | 1h 27 mins.
Pedro Domingos, author of the bestselling book "The Master Algorithm," introduces his latest work: Tensor Logic - a new programming language he believes could become the fundamental language for artificial intelligence.Think of it like this: Physics found its language in calculus. Circuit design found its language in Boolean logic. Pedro argues that AI has been missing its language - until now.**SPONSOR MESSAGES START**—Build your ideas with AI Studio from Google - http://ai.studio/build—Prolific - Quality data. From real people. For faster breakthroughs.https://www.prolific.com/?utm_source=mlst—cyber•Fund https://cyber.fund/?utm_source=mlst is a founder-led investment firm accelerating the cybernetic economyHiring a SF VC Principal: https://talent.cyber.fund/companies/cyber-fund-2/jobs/57674170-ai-investment-principal#content?utm_source=mlstSubmit investment deck: https://cyber.fund/contact?utm_source=mlst—**END**Current AI is split between two worlds that don't play well together:Deep Learning (neural networks, transformers, ChatGPT) - great at learning from data, terrible at logical reasoningSymbolic AI (logic programming, expert systems) - great at logical reasoning, terrible at learning from messy real-world dataTensor Logic unifies both. It's a single language where you can:Write logical rules that the system can actually learn and modifyDo transparent, verifiable reasoning (no hallucinations)Mix "fuzzy" analogical thinking with rock-solid deductionINTERACTIVE TRANSCRIPT:https://app.rescript.info/public/share/NP4vZQ-GTETeN_roB2vg64vbEcN7isjJtz4C86WSOhw TOC:00:00:00 - Introduction00:04:41 - What is Tensor Logic?00:09:59 - Tensor Logic vs PyTorch & Einsum00:17:50 - The Master Algorithm Connection00:20:41 - Predicate Invention & Learning New Concepts00:31:22 - Symmetries in AI & Physics00:35:30 - Computational Reducibility & The Universe00:43:34 - Technical Details: RNN Implementation00:45:35 - Turing Completeness Debate00:56:45 - Transformers vs Turing Machines01:02:32 - Reasoning in Embedding Space01:11:46 - Solving Hallucination with Deductive Modes01:16:17 - Adoption Strategy & Migration Path01:21:50 - AI Education & Abstraction01:24:50 - The Trillion-Dollar WasteREFSTensor Logic: The Language of AI [Pedro Domingos]https://arxiv.org/abs/2510.12269The Master Algorithm [Pedro Domingos]https://www.amazon.co.uk/Master-Algorithm-Ultimate-Learning-Machine/dp/0241004543 Einsum is All you Need (TIM ROCKTÄSCHEL)https://rockt.ai/2018/04/30/einsum https://www.youtube.com/watch?v=6DrCq8Ry2cw Autoregressive Large Language Models are Computationally Universal (Dale Schuurmans et al - GDM)https://arxiv.org/abs/2410.03170 Memory Augmented Large Language Models are Computationally Universal [Dale Schuurmans]https://arxiv.org/pdf/2301.04589 On the computational power of NNs [95/Siegelmann]https://binds.cs.umass.edu/papers/1995_Siegelmann_JComSysSci.pdf Sebastian Bubeckhttps://www.reddit.com/r/OpenAI/comments/1oacp38/openai_researcher_sebastian_bubeck_falsely_claims/ I am a strange loop - Hofstadterhttps://www.amazon.co.uk/Am-Strange-Loop-Douglas-Hofstadter/dp/0465030793 Stephen Wolframhttps://www.youtube.com/watch?v=dkpDjd2nHgo The Complex World: An Introduction to the Foundations of Complexity Science [David C. Krakauer]https://www.amazon.co.uk/Complex-World-Introduction-Foundations-Complexity/dp/1947864629 Geometric Deep Learninghttps://www.youtube.com/watch?v=bIZB1hIJ4u8Andrew Wilson (NYU)https://www.youtube.com/watch?v=M-jTeBCEGHcYi Mahttps://www.patreon.com/posts/yi-ma-scientific-141953348 Roger Penrose - road to realityhttps://www.amazon.co.uk/Road-Reality-Complete-Guide-Universe/dp/0099440687 Artificial Intelligence: A Modern Approach [Russel and Norvig]https://www.amazon.co.uk/Artificial-Intelligence-Modern-Approach-Global/dp/1292153962

He Co-Invented the Transformer. Now: Continuous Thought Machines - Llion Jones and Luke Darlow [Sakana AI]
2025-11-23 | 1h 12 mins.
The Transformer architecture (which powers ChatGPT and nearly all modern AI) might be trapping the industry in a localized rut, preventing us from finding true intelligent reasoning, according to the person who co-invented it. Llion Jones and Luke Darlow, key figures at the research lab Sakana AI, join the show to make this provocative argument, and also introduce new research which might lead the way forwards.**SPONSOR MESSAGES START**—Build your ideas with AI Studio from Google - http://ai.studio/build—Tufa AI Labs is hiring ML Research Engineers https://tufalabs.ai/ —cyber•Fund https://cyber.fund/?utm_source=mlst is a founder-led investment firm accelerating the cybernetic economyHiring a SF VC Principal: https://talent.cyber.fund/companies/cyber-fund-2/jobs/57674170-ai-investment-principal#content?utm_source=mlstSubmit investment deck: https://cyber.fund/contact?utm_source=mlst—**END**The "Spiral" Problem – Llion uses a striking visual analogy to explain what current AI is missing. If you ask a standard neural network to understand a spiral shape, it solves it by drawing tiny straight lines that just happen to look like a spiral. It "fakes" the shape without understanding the concept of spiraling. Introducing the Continuous Thought Machine (CTM) Luke Darlow deep dives into their solution: a biology-inspired model that fundamentally changes how AI processes information.The Maze Analogy: Luke explains that standard AI tries to solve a maze by staring at the whole image and guessing the entire path instantly. Their new machine "walks" through the maze step-by-step.Thinking Time: This allows the AI to "ponder." If a problem is hard, the model can naturally spend more time thinking about it before answering, effectively allowing it to correct its own mistakes and backtrack—something current Language Models struggle to do genuinely.https://sakana.ai/https://x.com/YesThisIsLionhttps://x.com/LearningLukeDTRANSCRIPT:https://app.rescript.info/public/share/crjzQ-Jo2FQsJc97xsBdfzfOIeMONpg0TFBuCgV2Fu8TOC:00:00:00 - Stepping Back from Transformers00:00:43 - Introduction to Continuous Thought Machines (CTM)00:01:09 - The Changing Atmosphere of AI Research00:04:13 - Sakana’s Philosophy: Research Freedom00:07:45 - The Local Minimum of Large Language Models00:18:30 - Representation Problems: The Spiral Example00:29:12 - Technical Deep Dive: CTM Architecture00:36:00 - Adaptive Computation & Maze Solving00:47:15 - Model Calibration & Uncertainty01:00:43 - Sudoku Bench: Measuring True ReasoningREFS:Why Greatness Cannot be planned [Kenneth Stanley]https://www.amazon.co.uk/Why-Greatness-Cannot-Planned-Objective/dp/3319155237https://www.youtube.com/watch?v=lhYGXYeMq_E The Hardware Lottery [Sara Hooker]https://arxiv.org/abs/2009.06489https://www.youtube.com/watch?v=sQFxbQ7ade0 Continuous Thought Machines [Luke Darlow et al / Sakana]https://arxiv.org/abs/2505.05522https://sakana.ai/ctm/ LSTM: The Comeback Story? [Prof. Sepp Hochreiter]https://www.youtube.com/watch?v=8u2pW2zZLCs Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis [Kumar/Stanley]https://arxiv.org/pdf/2505.11581 A Spline Theory of Deep Networks [Randall Balestriero]https://proceedings.mlr.press/v80/balestriero18b/balestriero18b.pdf https://www.youtube.com/watch?v=86ib0sfdFtw https://www.youtube.com/watch?v=l3O2J3LMxqI On the Biology of a Large Language Model [Anthropic, Jack Lindsey et al]https://transformer-circuits.pub/2025/attribution-graphs/biology.html The ARC Prize 2024 Winning Algorithm [Daniel Franzen and Jan Disselhoff] “The ARChitects”https://www.youtube.com/watch?v=mTX_sAq--zYNeural Turing Machine [Graves]https://arxiv.org/pdf/1410.5401 Adaptive Computation Time for Recurrent Neural Networks [Graves]https://arxiv.org/abs/1603.08983 Sudoko Bench [Sakana] https://pub.sakana.ai/sudoku/

Why Humans Are Still Powering AI [Sponsored]
2025-11-03 | 24 mins.
Ever wonder where AI models actually get their "intelligence"? We reveal the dirty secret of Silicon Valley: behind every impressive AI system are thousands of real humans providing crucial data, feedback, and expertise.Guest: Phelim Bradley, CEO and Co-founder of ProlificPhelim Bradley runs Prolific, a platform that connects AI companies with verified human experts who help train and evaluate their models. Think of it as a sophisticated marketplace matching the right human expertise to the right AI task - whether that's doctors evaluating medical chatbots or coders reviewing AI-generated software.Prolific: https://prolific.com/?utm_source=mlsthttps://uk.linkedin.com/in/phelim-bradley-84300826The discussion dives into:**The human data pipeline**: How AI companies rely on human intelligence to train, refine, and validate their models - something rarely discussed openly**Quality over quantity**: Why paying humans well and treating them as partners (not commodities) produces better AI training data**The matching challenge**: How Prolific solves the complex problem of finding the right expert for each specific task, similar to matching Uber drivers to riders but with deep expertise requirements**Future of work**: What it means when human expertise becomes an on-demand service, and why this might actually create more opportunities rather than fewer**Geopolitical implications**: Why the centralization of AI development in US tech companies should concern Europe and the UK



Machine Learning Street Talk (MLST)